[jira] [Resolved] (OAK-6534) Compute indexPaths from index definitions json
[ https://issues.apache.org/jira/browse/OAK-6534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra resolved OAK-6534. -- Resolution: Fixed Fix Version/s: 1.7.6 Done with 1804632 > Compute indexPaths from index definitions json > -- > > Key: OAK-6534 > URL: https://issues.apache.org/jira/browse/OAK-6534 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently while adding/updating indexes via {{--index-definitions-file}} > (OAK-6471) the index paths are always determined by {{--index-paths}} option. > If there are more index definitions present in the json file then those would > be ignored. > To avoid confusion following approach should be implemented > * If {{--index-paths}} is specified then use that > * If not and {{--index-definitions-file}} is provided then compute index > paths from that > * If both are specified then {{--index-paths}} then merge as user may want to > reindex few indexes and also update few others -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OAK-6534) Compute indexPaths from index definitions json
[ https://issues.apache.org/jira/browse/OAK-6534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra updated OAK-6534: - Description: Currently while adding/updating indexes via {{--index-definitions-file}} (OAK-6471) the index paths are always determined by {{--index-paths}} option. If there are more index definitions present in the json file then those would be ignored. To avoid confusion following approach should be implemented * If {{--index-paths}} is specified then use that * If not and {{--index-definitions-file}} is provided then compute index paths from that * If both are specified then {{--index-paths}} then merge as user may want to reindex few indexes and also update few others was: Currently while adding/updating indexes via {{--index-definitions-file}} (OAK-6471) the index paths are always determined by {{--index-paths}} option. If there are more index definitions present in the json file then those would be ignored. To avoid confusion following approach should be implemented * If {{--index-paths}} is specified then use that * If not and {{--index-definitions-file}} is provided then compute index paths from that * If both are specified then {{--index-paths}} takes precendence (no merging done) > Compute indexPaths from index definitions json > -- > > Key: OAK-6534 > URL: https://issues.apache.org/jira/browse/OAK-6534 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Minor > Fix For: 1.8 > > > Currently while adding/updating indexes via {{--index-definitions-file}} > (OAK-6471) the index paths are always determined by {{--index-paths}} option. > If there are more index definitions present in the json file then those would > be ignored. > To avoid confusion following approach should be implemented > * If {{--index-paths}} is specified then use that > * If not and {{--index-definitions-file}} is provided then compute index > paths from that > * If both are specified then {{--index-paths}} then merge as user may want to > reindex few indexes and also update few others -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6541) While importing new index property indexes are getting marked for reindex
[ https://issues.apache.org/jira/browse/OAK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra resolved OAK-6541. -- Resolution: Fixed Fix Version/s: 1.7.6 Fixed with 1804631 > While importing new index property indexes are getting marked for reindex > - > > Key: OAK-6541 > URL: https://issues.apache.org/jira/browse/OAK-6541 > Project: Jackrabbit Oak > Issue Type: Bug > Components: run >Affects Versions: 1.7.5 >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Minor > Fix For: 1.8, 1.7.6 > > > OAK-6471 added support for adding new indexes. While doing that its being > seen that non lucene indexes are getting marked for reindex -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OAK-6541) While importing new index property indexes are getting marked for reindex
Chetan Mehrotra created OAK-6541: Summary: While importing new index property indexes are getting marked for reindex Key: OAK-6541 URL: https://issues.apache.org/jira/browse/OAK-6541 Project: Jackrabbit Oak Issue Type: Bug Components: run Affects Versions: 1.7.5 Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Priority: Minor Fix For: 1.8 OAK-6471 added support for adding new indexes. While doing that its being seen that non lucene indexes are getting marked for reindex -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6504) Active deletion of blobs needs to indicate information about purged blobs to mark-sweep collector
[ https://issues.apache.org/jira/browse/OAK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Jain resolved OAK-6504. Resolution: Fixed Incorporated the review suggestion, done with - r1804626, r1804628 > Active deletion of blobs needs to indicate information about purged blobs to > mark-sweep collector > - > > Key: OAK-6504 > URL: https://issues.apache.org/jira/browse/OAK-6504 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.7.1 >Reporter: Vikas Saurabh >Assignee: Amit Jain >Priority: Minor > Fix For: 1.8, 1.7.6 > > Attachments: OAK_6504.patch > > > Mark sweep blob collector (since 1.6) tracks blobs in store. Active purge of > lucene index blobs doesn't update these tracked blobs which leads to mark > sweep collector to attempt to delete those blobs again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6497) Support old Segment NodeStore setups for oak-run index tooling
[ https://issues.apache.org/jira/browse/OAK-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16121009#comment-16121009 ] Chetan Mehrotra commented on OAK-6497: -- With 1804624 added support to fallback to older oak-segment in case of InvalidFileStoreVersionException. With this user need not specify {{--segment}} option explicitly and tooling would take care of that > Support old Segment NodeStore setups for oak-run index tooling > -- > > Key: OAK-6497 > URL: https://issues.apache.org/jira/browse/OAK-6497 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra > Fix For: 1.8, 1.7.6 > > Attachments: OAK-6497-v1.patch > > > oak-run index command has been introduced in trunk and can be used in read > only mode against existing setups. This would work fine for all > DocumentNodeStore setups. However would not work SegmentNodeStore setups <= > Oak 1.4 > This task is meant to figure out possible approaches for enabling such a > support for oak-run builds from trunk -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-937) Query engine index selection tweaks: shortcut and hint
[ https://issues.apache.org/jira/browse/OAK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120994#comment-16120994 ] Chetan Mehrotra commented on OAK-937: - bq. For example, each index can have a multi-valued property "tags". Then a query can specify "option(index tag )". +1. This allows customer to bind to specific index or enable QE to select from a set of indexes. [~catholicon] Regarding the aggregate - There are other cases also like custom synonyms, analyzer configured for same nodetype. So its best to do selection at index level instead. > Query engine index selection tweaks: shortcut and hint > -- > > Key: OAK-937 > URL: https://issues.apache.org/jira/browse/OAK-937 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: query >Reporter: Alex Deparvu >Assignee: Thomas Mueller >Priority: Critical > Labels: performance > Fix For: 1.8 > > > This issue covers 2 different changes related to the way the QueryEngine > selects a query index: > Firstly there could be a way to end the index selection process early via a > known constant value: if an index returns a known value token (like -1000) > then the query engine would effectively stop iterating through the existing > index impls and use that index directly. > Secondly it would be nice to be able to specify a desired index (if one is > known to perform better) thus skipping the existing selection mechanism (cost > calculation and comparison). This could be done via certain query hints [0]. > [0] http://en.wikipedia.org/wiki/Hint_(SQL) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-937) Query engine index selection tweaks: shortcut and hint
[ https://issues.apache.org/jira/browse/OAK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120750#comment-16120750 ] Vikas Saurabh commented on OAK-937: --- While I like the idea of providing tag-based-index hints (with a minor improvement could be pick a set of tags - "option(index tag ,") ); but bq. The main problem I want to address with this issue is: there are multiple Lucene index configurations, with different aggregation rules. I think this particular problem might be solved by doing indirection inside index def itself. e.g. {noformat} + /aggregates// + useCase1/ - oak:aggregateClassifier = true + + useCase2/ - oak:aggregateClassifier = true + + {noformat} ... and extend {{contains()}} clause to potentially choose nothing (all aggregates participate) or a subset of classifiers. The reason I'd want to solve multiple use-cases of aggregation/nodeScopeIndex this way is to still hold the convention that we have one index for a particular type - that, imo, makes people think more about index design and also provides a clearer view right away from index definitions (yes, tag approach would also work... but to me humans are worse at indirection than computers) > Query engine index selection tweaks: shortcut and hint > -- > > Key: OAK-937 > URL: https://issues.apache.org/jira/browse/OAK-937 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: query >Reporter: Alex Deparvu >Assignee: Thomas Mueller >Priority: Critical > Labels: performance > Fix For: 1.8 > > > This issue covers 2 different changes related to the way the QueryEngine > selects a query index: > Firstly there could be a way to end the index selection process early via a > known constant value: if an index returns a known value token (like -1000) > then the query engine would effectively stop iterating through the existing > index impls and use that index directly. > Secondly it would be nice to be able to specify a desired index (if one is > known to perform better) thus skipping the existing selection mechanism (cost > calculation and comparison). This could be done via certain query hints [0]. > [0] http://en.wikipedia.org/wiki/Hint_(SQL) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6540) Session.hasAccess(...) should reflect read-only status of mounts
[ https://issues.apache.org/jira/browse/OAK-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120127#comment-16120127 ] angela commented on OAK-6540: - [~rombert], IMHO it has nothing to do with the security component as the read-only status is not defined by means of security. What I would suggest though is to use {{Session.hasCapability}} for that matter... this is exactly what your are looking for from a JCR API point of view :-) See https://docs.adobe.com/docs/en/spec/jcr/2.0/9_Permissions_and_Capabilities.html > Session.hasAccess(...) should reflect read-only status of mounts > > > Key: OAK-6540 > URL: https://issues.apache.org/jira/browse/OAK-6540 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: composite, security >Reporter: Robert Munteanu > Fix For: 1.8, 1.7.6 > > > When a mount is set in read-only mode callers that check > {{Session.hasPermission("set_property", ...)}} or > {{Session.hasPermission("add_node", ...)}} for mounted paths will believe > that they are able to write under those paths. For a composite setup with a > read-only mount this should (IMO) reflect that callers are not able to write, > taking into account the mount information on top of the ACEs. > [~anchela], [~stillalex] - WDYT? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6540) Session.hasAccess(...) should reflect read-only status of mounts
[ https://issues.apache.org/jira/browse/OAK-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120113#comment-16120113 ] Robert Munteanu commented on OAK-6540: -- [~anchela] - thanks for the quick reply. Do you see a way of surfacing this read-only status from the POV of the security component? I'd like to avoid binding clients to the {{spi.mount}} package. > Session.hasAccess(...) should reflect read-only status of mounts > > > Key: OAK-6540 > URL: https://issues.apache.org/jira/browse/OAK-6540 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: composite, security >Reporter: Robert Munteanu > Fix For: 1.8, 1.7.6 > > > When a mount is set in read-only mode callers that check > {{Session.hasPermission("set_property", ...)}} or > {{Session.hasPermission("add_node", ...)}} for mounted paths will believe > that they are able to write under those paths. For a composite setup with a > read-only mount this should (IMO) reflect that callers are not able to write, > taking into account the mount information on top of the ACEs. > [~anchela], [~stillalex] - WDYT? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6539) Decrease version export for org.apache.jackrabbit.oak.spi.security.authentication
[ https://issues.apache.org/jira/browse/OAK-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120101#comment-16120101 ] Robert Munteanu commented on OAK-6539: -- Are there {{@ProviderType}} interfaces exposed by this package? If so, I think it's unsafe to change the version back. The reason is that if a package implements a {{@ProviderType}} interface from this package it would import {{[1.3.0,1.4.0)}} . If we move the version back to {{1.2.0}} then the imports would no longer resolve. On the other hand if this version was not included in a release we can revert it. > Decrease version export for > org.apache.jackrabbit.oak.spi.security.authentication > - > > Key: OAK-6539 > URL: https://issues.apache.org/jira/browse/OAK-6539 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, security >Reporter: Alex Deparvu >Assignee: Alex Deparvu >Priority: Trivial > > There's a warning when building oak-core related to the export version for > the org.apache.jackrabbit.oak.spi.security.authentication package: > {noformat} > [WARNING] org.apache.jackrabbit.oak.spi.security.authentication: Excessive > version increase; detected 1.3.0, suggested 1.2.0 > {noformat} > I see no reason to not decrease the version. [~anchela], thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6540) Session.hasAccess(...) should reflect read-only status of mounts
[ https://issues.apache.org/jira/browse/OAK-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela resolved OAK-6540. - Resolution: Invalid > Session.hasAccess(...) should reflect read-only status of mounts > > > Key: OAK-6540 > URL: https://issues.apache.org/jira/browse/OAK-6540 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: composite, security >Reporter: Robert Munteanu > Fix For: 1.8, 1.7.6 > > > When a mount is set in read-only mode callers that check > {{Session.hasPermission("set_property", ...)}} or > {{Session.hasPermission("add_node", ...)}} for mounted paths will believe > that they are able to write under those paths. For a composite setup with a > read-only mount this should (IMO) reflect that callers are not able to write, > taking into account the mount information on top of the ACEs. > [~anchela], [~stillalex] - WDYT? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OAK-6540) Session.hasAccess(...) should reflect read-only status of mounts
[ https://issues.apache.org/jira/browse/OAK-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela updated OAK-6540: Component/s: security > Session.hasAccess(...) should reflect read-only status of mounts > > > Key: OAK-6540 > URL: https://issues.apache.org/jira/browse/OAK-6540 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: composite, security >Reporter: Robert Munteanu > Fix For: 1.8, 1.7.6 > > > When a mount is set in read-only mode callers that check > {{Session.hasPermission("set_property", ...)}} or > {{Session.hasPermission("add_node", ...)}} for mounted paths will believe > that they are able to write under those paths. For a composite setup with a > read-only mount this should (IMO) reflect that callers are not able to write, > taking into account the mount information on top of the ACEs. > [~anchela], [~stillalex] - WDYT? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6540) Session.hasAccess(...) should reflect read-only status of mounts
[ https://issues.apache.org/jira/browse/OAK-6540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120090#comment-16120090 ] angela commented on OAK-6540: - [~rombert], I don't think that this would be correct as the read-only status has nothing to do with permission evalution. the read-only status of a mount is rather like the read-only status of the version storage, which isn't reflected in {{Session.hasPermission}} either. > Session.hasAccess(...) should reflect read-only status of mounts > > > Key: OAK-6540 > URL: https://issues.apache.org/jira/browse/OAK-6540 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: composite >Reporter: Robert Munteanu > Fix For: 1.8, 1.7.6 > > > When a mount is set in read-only mode callers that check > {{Session.hasPermission("set_property", ...)}} or > {{Session.hasPermission("add_node", ...)}} for mounted paths will believe > that they are able to write under those paths. For a composite setup with a > read-only mount this should (IMO) reflect that callers are not able to write, > taking into account the mount information on top of the ACEs. > [~anchela], [~stillalex] - WDYT? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6539) Decrease version export for org.apache.jackrabbit.oak.spi.security.authentication
[ https://issues.apache.org/jira/browse/OAK-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120087#comment-16120087 ] angela commented on OAK-6539: - [~stillalex], no that I was aware of... i remember that i once had a major version bump and [~rombert] fixed that by adding provider type annotation... but i wasn't aware of that warning. feel free to fix it, removing a warning is always good! thanks for spotting. > Decrease version export for > org.apache.jackrabbit.oak.spi.security.authentication > - > > Key: OAK-6539 > URL: https://issues.apache.org/jira/browse/OAK-6539 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, security >Reporter: Alex Deparvu >Assignee: Alex Deparvu >Priority: Trivial > > There's a warning when building oak-core related to the export version for > the org.apache.jackrabbit.oak.spi.security.authentication package: > {noformat} > [WARNING] org.apache.jackrabbit.oak.spi.security.authentication: Excessive > version increase; detected 1.3.0, suggested 1.2.0 > {noformat} > I see no reason to not decrease the version. [~anchela], thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6504) Active deletion of blobs needs to indicate information about purged blobs to mark-sweep collector
[ https://issues.apache.org/jira/browse/OAK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120081#comment-16120081 ] Vikas Saurabh commented on OAK-6504: [~amitjain], the fix looks good to me. A minor nitpick though - I think the temp file to track deleted blobs should be created in {{rootDirectory}} passed onto {{ActiveDeletedBlobCollectorFactory}}. For the test, I have OAK-6334 on my plate. I'd try to refactor those later. For now, extracting out and creating the new class look fine to me. > Active deletion of blobs needs to indicate information about purged blobs to > mark-sweep collector > - > > Key: OAK-6504 > URL: https://issues.apache.org/jira/browse/OAK-6504 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.7.1 >Reporter: Vikas Saurabh >Assignee: Amit Jain >Priority: Minor > Fix For: 1.8, 1.7.6 > > Attachments: OAK_6504.patch > > > Mark sweep blob collector (since 1.6) tracks blobs in store. Active purge of > lucene index blobs doesn't update these tracked blobs which leads to mark > sweep collector to attempt to delete those blobs again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OAK-6540) Session.hasAccess(...) should reflect read-only status of mounts
Robert Munteanu created OAK-6540: Summary: Session.hasAccess(...) should reflect read-only status of mounts Key: OAK-6540 URL: https://issues.apache.org/jira/browse/OAK-6540 Project: Jackrabbit Oak Issue Type: Improvement Components: composite Reporter: Robert Munteanu Fix For: 1.8, 1.7.6 When a mount is set in read-only mode callers that check {{Session.hasPermission("set_property", ...)}} or {{Session.hasPermission("add_node", ...)}} for mounted paths will believe that they are able to write under those paths. For a composite setup with a read-only mount this should (IMO) reflect that callers are not able to write, taking into account the mount information on top of the ACEs. [~anchela], [~stillalex] - WDYT? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OAK-6513) Journal based Async Indexer
[ https://issues.apache.org/jira/browse/OAK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra updated OAK-6513: - Description: Current async indexer design is based on NodeState diff. This has served us fine so far however off late it is not able to perform well if rate of repository writes is high. When changes happen faster than index-update can process them, larger and larger diffs will happen. These make index-updates slower, which again lead to the next diff being ever larger than the one before (assuming a constant ingestion rate). In current diff based flow the indexer performs complete diff for all changes happening between 2 cycle. It may happen that lots of writes happens but not much indexable content is written. So doing diff there is a wasted effort. In 1.6 release for NRT Indexing we implemented a journal based indexing for external changes(OAK-4808, OAK-5430). That approach can be generalized and used for async indexing. Before talking about the journal based approach lets see how IndexEditor work currently h4. IndexEditor Currently any IndexEditor performs 2 tasks # Identify which node is to be indexed based on some index definition. The Editor gets invoked as part of content diff where it determines which NodeState is to be indexed # Update the index based on node to be indexed For e.g. in oak-lucene we have LuceneIndexEditor which identifies the NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene Document from NodeState to be indexed. For journal based approach we can decouple these 2 parts and thus have * IndexEditor - Identifies which all paths need to be indexed for given index definition * IndexUpdater - Updates the index based on given NodeState and its path h4. High Level Flow # Session Commit Flow ## Each index type would provide a IndexEditor which would be invoked as part of commit (like sync indexes). These IndexEditor would just determine which paths needs to be indexed. ## As part of commit the paths to be indexed would be written to journal. # AsyncIndexUpdate flow ## AsyncIndexUpdate would query this journal to fetch all such indexed paths between the 2 checkpoints ## Based on the index path data it would invoke the {{IndexUpdater}} to update the index for that path ## Merge the index updates h4. Benefits Such a design would have following impact # More work done as part of write # Marking of indexable content is distributed hence at indexing time lesser work to be done # Indexing can progress in batches # The indexers can be called in parallel h4. Journal Implementation DocumentNodeStore currently has an in built journal which is being used for NRT Indexing. That feature can be exposed as an api. For scaling index this design is mostly required for cluster case. So we can possibly have both indexing support implemented and use the journal based support for DocumentNodeStore setups. Or we can look into implementing such a journal for SegmentNodeStore setups also h4. Open Points * Journal support in SegmentNodeStore * Handling deletes. Detailed proposal - https://wiki.apache.org/jackrabbit/Journal%20based%20Async%20Indexer was: Current async indexer design is based on NodeState diff. This has served us fine so far however off late it is not able to perform well if rate of repository writes is high. When changes happen faster than index-update can process them, larger and larger diffs will happen. These make index-updates slower, which again lead to the next diff being ever larger than the one before (assuming a constant ingestion rate). In current diff based flow the indexer performs complete diff for all changes happening between 2 cycle. It may happen that lots of writes happens but not much indexable content is written. So doing diff there is a wasted effort. In 1.6 release for NRT Indexing we implemented a journal based indexing for external changes(OAK-4808, OAK-5430). That approach can be generalized and used for async indexing. Before talking about the journal based approach lets see how IndexEditor work currently h4. IndexEditor Currently any IndexEditor performs 2 tasks # Identify which node is to be indexed based on some index definition. The Editor gets invoked as part of content diff where it determines which NodeState is to be indexed # Update the index based on node to be indexed For e.g. in oak-lucene we have LuceneIndexEditor which identifies the NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene Document from NodeState to be indexed. For journal based approach we can decouple these 2 parts and thus have * IndexEditor - Identifies which all paths need to be indexed for given index definition * IndexUpdater - Updates the index based on given NodeState and its path h4. High Level Flow # Session Commit Flow ## Each index type would provide a
[jira] [Created] (OAK-6539) Decrease version export for org.apache.jackrabbit.oak.spi.security.authentication
Alex Deparvu created OAK-6539: - Summary: Decrease version export for org.apache.jackrabbit.oak.spi.security.authentication Key: OAK-6539 URL: https://issues.apache.org/jira/browse/OAK-6539 Project: Jackrabbit Oak Issue Type: Improvement Components: core, security Reporter: Alex Deparvu Assignee: Alex Deparvu Priority: Trivial There's a warning when building oak-core related to the export version for the org.apache.jackrabbit.oak.spi.security.authentication package: {noformat} [WARNING] org.apache.jackrabbit.oak.spi.security.authentication: Excessive version increase; detected 1.3.0, suggested 1.2.0 {noformat} I see no reason to not decrease the version. [~anchela], thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Dulceanu resolved OAK-5902. -- Resolution: Fixed Fixed at r1804515. Created OAK-6538 to investigate cold standby memory consumption. > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6527) CompositeNodeStore permission evaluation fails for open setups
[ https://issues.apache.org/jira/browse/OAK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Deparvu resolved OAK-6527. --- Resolution: Fixed fixed with http://svn.apache.org/viewvc?rev=1804509=rev following [~anchela]'s feedback I moved the flush method and dropped the AbstractPermissionStore. > CompositeNodeStore permission evaluation fails for open setups > -- > > Key: OAK-6527 > URL: https://issues.apache.org/jira/browse/OAK-6527 > Project: Jackrabbit Oak > Issue Type: Bug > Components: composite, security >Affects Versions: 1.7.3, 1.7.4, 1.7.5 >Reporter: Alex Deparvu >Assignee: Alex Deparvu > Fix For: 1.7.6 > > > It seems the current setup of OR-ing the composite nodestore permission > setups breaks down when the root node has an allow all reads. This seems a > fundamental flaw in the way it works now, so I'm considering going back to > the drawing board and working on the solution proposed by [~chetanm] as a > part of OAK-3777, effectively making OAK-6356 and OAK-6469 obsolete. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OAK-6538) Investigate cold standby memory consumption
Andrei Dulceanu created OAK-6538: Summary: Investigate cold standby memory consumption Key: OAK-6538 URL: https://issues.apache.org/jira/browse/OAK-6538 Project: Jackrabbit Oak Issue Type: Task Components: segment-tar Affects Versions: 1.6.1 Reporter: Andrei Dulceanu Assignee: Andrei Dulceanu Priority: Minor Fix For: 1.8, 1.7.6 In an investigation from some time ago, 4GB of heap were needed for transferring 1GB blob and 6GB for 2GB blob. This was in part due to using {{addTestContent}} [0] in the investigation, which allocates a huge {{byte[]}} on the heap. OAK-5902 introduced chunking for transferring blobs between primary and standby. This way, the memory needed for syncing a big blob should be around the chunk size used. Solving the way test data is created, it should be possible to transfer a big blob (e.g. 2.5 GB) with less memory. [0] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/standby/DataStoreTestBase.java#L96 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6537) Don't encode the checksums in the TAR index tests
[ https://issues.apache.org/jira/browse/OAK-6537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francesco Mari resolved OAK-6537. - Resolution: Fixed Fixed at r1804504. > Don't encode the checksums in the TAR index tests > - > > Key: OAK-6537 > URL: https://issues.apache.org/jira/browse/OAK-6537 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Reporter: Francesco Mari >Assignee: Francesco Mari > Fix For: 1.8, 1.7.6 > > > The tests for the different formats of the TAR indices encode the checksums > of the entries. This makes the tests particularly brittle. The checksums > should be computed on the fly based on the test data. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (OAK-4638) Mostly async unique index (for UUIDs for example)
[ https://issues.apache.org/jira/browse/OAK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119733#comment-16119733 ] Chetan Mehrotra edited comment on OAK-4638 at 8/9/17 11:19 AM: --- Based on approach proposed here I have also created OAK-4638 which covers both. Put up an initial proposal at [https://wiki.apache.org/jackrabbit/Synchronous Lucene Property Indexes] was (Author: chetanm): Based on approach proposed here I have also created OAK-4638 which covers both. Put up an initial proposal at https://wiki.apache.org/jackrabbit/Synchronous Lucene Property Indexes > Mostly async unique index (for UUIDs for example) > - > > Key: OAK-4638 > URL: https://issues.apache.org/jira/browse/OAK-4638 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: property-index, query >Reporter: Thomas Mueller > > The UUID index takes a lot of space. For the UUID index, we should consider > using mainly an async index. This is possible because there are two types of > UUIDs: those generated in Oak, which are sure to be unique (no need to > check), and those set in the application code, for example by importing > packages. For older nodes, an async index is sufficient, and a synchronous > index is only (temporarily) needed for imported nodes. For UUIDs, we could > also change the generation algorithm if needed. > It might be possible to use a similar pattern for regular unique indexes as > well: only keep the added entries of the last 24 hours (for example) in a > property index, and then move entries to an async index which needs less > space. That would slow down adding entries, as two indexes need to be checked. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-4638) Mostly async unique index (for UUIDs for example)
[ https://issues.apache.org/jira/browse/OAK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119733#comment-16119733 ] Chetan Mehrotra commented on OAK-4638: -- Based on approach proposed here I have also created OAK-4638 which covers both. Put up an initial proposal at https://wiki.apache.org/jackrabbit/Synchronous Lucene Property Indexes > Mostly async unique index (for UUIDs for example) > - > > Key: OAK-4638 > URL: https://issues.apache.org/jira/browse/OAK-4638 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: property-index, query >Reporter: Thomas Mueller > > The UUID index takes a lot of space. For the UUID index, we should consider > using mainly an async index. This is possible because there are two types of > UUIDs: those generated in Oak, which are sure to be unique (no need to > check), and those set in the application code, for example by importing > packages. For older nodes, an async index is sufficient, and a synchronous > index is only (temporarily) needed for imported nodes. For UUIDs, we could > also change the generation algorithm if needed. > It might be possible to use a similar pattern for regular unique indexes as > well: only keep the added entries of the last 24 hours (for example) in a > property index, and then move entries to an async index which needs less > space. That would slow down adding entries, as two indexes need to be checked. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OAK-6537) Don't encode the checksums in the TAR index tests
Francesco Mari created OAK-6537: --- Summary: Don't encode the checksums in the TAR index tests Key: OAK-6537 URL: https://issues.apache.org/jira/browse/OAK-6537 Project: Jackrabbit Oak Issue Type: Improvement Components: segment-tar Reporter: Francesco Mari Assignee: Francesco Mari Fix For: 1.8, 1.7.6 The tests for the different formats of the TAR indices encode the checksums of the entries. This makes the tests particularly brittle. The checksums should be computed on the fly based on the test data. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6529) IndexLoaderV1 and IndexLoaderV2 should not rely on Buffer.array()
[ https://issues.apache.org/jira/browse/OAK-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francesco Mari resolved OAK-6529. - Resolution: Fixed Fixed at r1804503. > IndexLoaderV1 and IndexLoaderV2 should not rely on Buffer.array() > - > > Key: OAK-6529 > URL: https://issues.apache.org/jira/browse/OAK-6529 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar >Reporter: Francesco Mari >Assignee: Francesco Mari > Fix For: 1.8, 1.7.6 > > > The code in {{IndexLoaderV1}} and {{IndexLoaderV2}} calls {{Buffer.array()}} > to compute the checksum. This method might fail with an > {{UnsupportedOperationException}} if the {{Buffer}} points to a memory mapped > region. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119687#comment-16119687 ] Francesco Mari commented on OAK-5902: - [~dulceanu] makes sense. Go ahead and commit this, we will tackle the rest later. > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-3710) Continuous revision GC
[ https://issues.apache.org/jira/browse/OAK-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Reutegger resolved OAK-3710. --- Resolution: Fixed Assignee: Marcel Reutegger Fix Version/s: 1.7.6 1.8 This feature is now implemented but disabled by default. See also documentation on OSGi configuration for the DocumentNodeStore (versionGCContinuous): https://jackrabbit.apache.org/oak/docs/osgi_config.html#DocumentNodeStore > Continuous revision GC > -- > > Key: OAK-3710 > URL: https://issues.apache.org/jira/browse/OAK-3710 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: documentmk >Reporter: Marcel Reutegger >Assignee: Marcel Reutegger > Fix For: 1.8, 1.7.6 > > > Implement continuous revision GC cleaning up documents older than a given > threshold (e.g. one day). This issue is related to OAK-3070 where each GC run > starts where the last one finished. > This will avoid peak load on the system as we see it right now, when GC is > triggered once a day. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (OAK-6536) Periodic log message from continuous RGC
[ https://issues.apache.org/jira/browse/OAK-6536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcel Reutegger resolved OAK-6536. --- Resolution: Fixed Fix Version/s: 1.7.6 The continuous revision GC job now logs an info message ever hour. Implemented in trunk: http://svn.apache.org/r1804500 > Periodic log message from continuous RGC > > > Key: OAK-6536 > URL: https://issues.apache.org/jira/browse/OAK-6536 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: core, documentmk >Reporter: Marcel Reutegger >Assignee: Marcel Reutegger >Priority: Minor > Fix For: 1.8, 1.7.6 > > > The continuous revision garbage collection should issue periodic info log > messages with statistics. The format should be similar to the log message > issued by the regular revision garbage collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119638#comment-16119638 ] Michael Dürig commented on OAK-5902: bq. I suggest to create a separate issue for analysing the memory consumption and to commit all the changes +1 > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OAK-6504) Active deletion of blobs needs to indicate information about purged blobs to mark-sweep collector
[ https://issues.apache.org/jira/browse/OAK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Jain updated OAK-6504: --- Attachment: OAK_6504.patch Attached patch. [~catholicon], [~chetanm] Please review. I also restructured the existing ActiveDeletedBlobCollectionIT to extract an abstract class and added a new test class. > Active deletion of blobs needs to indicate information about purged blobs to > mark-sweep collector > - > > Key: OAK-6504 > URL: https://issues.apache.org/jira/browse/OAK-6504 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.7.1 >Reporter: Vikas Saurabh >Assignee: Amit Jain >Priority: Minor > Fix For: 1.8, 1.7.6 > > Attachments: OAK_6504.patch > > > Mark sweep blob collector (since 1.6) tracks blobs in store. Active purge of > lucene index blobs doesn't update these tracked blobs which leads to mark > sweep collector to attempt to delete those blobs again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OAK-6504) Active deletion of blobs needs to indicate information about purged blobs to mark-sweep collector
[ https://issues.apache.org/jira/browse/OAK-6504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Jain updated OAK-6504: --- Fix Version/s: 1.7.6 > Active deletion of blobs needs to indicate information about purged blobs to > mark-sweep collector > - > > Key: OAK-6504 > URL: https://issues.apache.org/jira/browse/OAK-6504 > Project: Jackrabbit Oak > Issue Type: Bug > Components: lucene >Affects Versions: 1.7.1 >Reporter: Vikas Saurabh >Assignee: Amit Jain >Priority: Minor > Fix For: 1.8, 1.7.6 > > Attachments: OAK_6504.patch > > > Mark sweep blob collector (since 1.6) tracks blobs in store. Active purge of > lucene index blobs doesn't update these tracked blobs which leads to mark > sweep collector to attempt to delete those blobs again. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119568#comment-16119568 ] Andrei Dulceanu commented on OAK-5902: -- [~frm], [~mduerig] bq. Before committing, the problems with the memory consumption in {{DataStoreTestBase.testSyncBigBlog}} I think the memory consumption is the key here. In an investigation from some time ago, 4GB of heap were needed for 1GB blob and 6GB for 2GB blob. This was in part due to using {{addTestContent}} in the investigation, which allocates that huge {{byte[]}} on the heap. With the new approach in {{addTestContentOnTheFly}} this problem is solved, and the chunking per se improved things a lot. We are now in the position of successfully syncing 2.5 GB blob with only 3.5 GB memory. bq. the running time in ExternalPrivateStoreIT should be investigated. My analysis shows that {{51s}} are spent for adding the test content (i.e. 2.5 GB blob), {{61s}} are spent for syncing between master and standby and another {{44s}} are spent for checking that the sync was ok (i.e. comparing two streams summing up to 2.5 GB). I find nothing unusual here. bq. Agreed, increasing the heap for the tests is problematic and we shouldn't do this. At least we need to understand where the memory requirements come from: is it the test or the code? Agree. I suggest to create a separate issue for analysing the memory consumption and to commit all the changes, except: * heap size increase in {{pom.xml}} * annotate {{testSyncBigBlog}} with {{@Ignore(OAK-XXX)}} Since all our ITs for cold standby use chunking now (the default {{1MB}} chunk size) and they all pass, I'd say we can safely commit the rest of the changes, as explained above. WDYT? > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OAK-6536) Periodic log message from continuous RGC
Marcel Reutegger created OAK-6536: - Summary: Periodic log message from continuous RGC Key: OAK-6536 URL: https://issues.apache.org/jira/browse/OAK-6536 Project: Jackrabbit Oak Issue Type: Improvement Components: core, documentmk Reporter: Marcel Reutegger Assignee: Marcel Reutegger Priority: Minor Fix For: 1.8 The continuous revision garbage collection should issue periodic info log messages with statistics. The format should be similar to the log message issued by the regular revision garbage collector. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119542#comment-16119542 ] Michael Dürig commented on OAK-5902: bq. memory consumption Agreed, increasing the heap for the tests is problematic and we shouldn't do this. At least we need to understand where the memory requirements come from: is it the test or the code? > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119535#comment-16119535 ] Francesco Mari commented on OAK-5902: - [~dulceanu], I had a look at the code. Your solution looks very good. Before committing, the problems with the memory consumption in `DataStoreTestBase.testSyncBigBlob` and the running time in `ExternalPrivateStoreIT` should be investigated. It would be good to at least frame the problem, so we can plan for further improvements on this patch. > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (OAK-5902) Cold standby should allow syncing of blobs bigger than 2.2 GB
[ https://issues.apache.org/jira/browse/OAK-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119535#comment-16119535 ] Francesco Mari edited comment on OAK-5902 at 8/9/17 7:47 AM: - [~dulceanu], I had a look at the code. Your solution looks very good. Before committing, the problems with the memory consumption in {{DataStoreTestBase.testSyncBigBlo}} and the running time in {{ExternalPrivateStoreIT}} should be investigated. It would be good to at least frame the problem, so we can plan for further improvements on this patch. was (Author: frm): [~dulceanu], I had a look at the code. Your solution looks very good. Before committing, the problems with the memory consumption in `DataStoreTestBase.testSyncBigBlob` and the running time in `ExternalPrivateStoreIT` should be investigated. It would be good to at least frame the problem, so we can plan for further improvements on this patch. > Cold standby should allow syncing of blobs bigger than 2.2 GB > - > > Key: OAK-5902 > URL: https://issues.apache.org/jira/browse/OAK-5902 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: segment-tar >Affects Versions: 1.6.1 >Reporter: Andrei Dulceanu >Assignee: Andrei Dulceanu >Priority: Minor > Fix For: 1.8, 1.7.6 > > > Currently there is a limitation for the maximum binary size (in bytes) to be > synced between primary and standby instances. This matches > {{Integer.MAX_VALUE}} (2,147,483,647) bytes and no binaries bigger than this > limit can be synced between the instances. > Per comment at [1], the current protocol needs to be changed to allow sending > of binaries in chunks, to surpass this limitation. > [1] > https://github.com/apache/jackrabbit-oak/blob/1.6/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/StandbyClient.java#L125 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (OAK-6534) Compute indexPaths from index definitions json
[ https://issues.apache.org/jira/browse/OAK-6534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119499#comment-16119499 ] Paul Chibulcuteanu edited comment on OAK-6534 at 8/9/17 6:54 AM: - [~chetanm], yes this would be fine. This way, if one wants to reindex everything present in the --index-definitions-file then --index-paths should not be provided. was (Author: chibulcu): [~chetanm], yes this would be fine. This way, if one wants to reindex everything present in the _--index-definitions-file_ then _--index-paths_ should not be provided. > Compute indexPaths from index definitions json > -- > > Key: OAK-6534 > URL: https://issues.apache.org/jira/browse/OAK-6534 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Minor > Fix For: 1.8 > > > Currently while adding/updating indexes via {{--index-definitions-file}} > (OAK-6471) the index paths are always determined by {{--index-paths}} option. > If there are more index definitions present in the json file then those would > be ignored. > To avoid confusion following approach should be implemented > * If {{--index-paths}} is specified then use that > * If not and {{--index-definitions-file}} is provided then compute index > paths from that > * If both are specified then {{--index-paths}} takes precendence (no merging > done) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (OAK-6534) Compute indexPaths from index definitions json
[ https://issues.apache.org/jira/browse/OAK-6534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119499#comment-16119499 ] Paul Chibulcuteanu commented on OAK-6534: - [~chetanm], yes this would be fine. This way, if one wants to reindex everything present in the _--index-definitions-file_ then _--index-paths_ should not be provided. > Compute indexPaths from index definitions json > -- > > Key: OAK-6534 > URL: https://issues.apache.org/jira/browse/OAK-6534 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: run >Reporter: Chetan Mehrotra >Assignee: Chetan Mehrotra >Priority: Minor > Fix For: 1.8 > > > Currently while adding/updating indexes via {{--index-definitions-file}} > (OAK-6471) the index paths are always determined by {{--index-paths}} option. > If there are more index definitions present in the json file then those would > be ignored. > To avoid confusion following approach should be implemented > * If {{--index-paths}} is specified then use that > * If not and {{--index-definitions-file}} is provided then compute index > paths from that > * If both are specified then {{--index-paths}} takes precendence (no merging > done) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (OAK-6535) Synchronous Lucene Property Indexes
Chetan Mehrotra created OAK-6535: Summary: Synchronous Lucene Property Indexes Key: OAK-6535 URL: https://issues.apache.org/jira/browse/OAK-6535 Project: Jackrabbit Oak Issue Type: New Feature Components: lucene, property-index Reporter: Chetan Mehrotra Assignee: Chetan Mehrotra Fix For: 1.8 Oak 1.6 added support for Lucene Hybrid Index (OAK-4412). That enables near real time (NRT) support for Lucene based indexes. It also had a limited support for sync indexes. This feature aims to improve that to next level and enable support for sync property indexes. More details at https://wiki.apache.org/jackrabbit/Synchronous%20Lucene%20Property%20Indexes -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (OAK-6269) Support non chunk storage in OakDirectory
[ https://issues.apache.org/jira/browse/OAK-6269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra reassigned OAK-6269: Assignee: Vikas Saurabh > Support non chunk storage in OakDirectory > - > > Key: OAK-6269 > URL: https://issues.apache.org/jira/browse/OAK-6269 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Chetan Mehrotra >Assignee: Vikas Saurabh > Fix For: 1.8 > > > Logging this issue based on offline discussion with [~catholicon]. > Currently OakDirectory stores files in chunk of 1 MB each. So a 1 GB file > would be stored in 1000+ chunks of 1 MB. > This design was done to support direct usage of OakDirectory with Lucene as > Lucene makes use of random io. Chunked storage allows it to seek to random > position quickly. If the files are stored as Blobs then its only possible to > access via streaming which would be slow > As most setup now use copy-on-read and copy-on-write support and rely on > local copy of index we can have an implementation which stores the file as > single blob. > *Pros* > * Quite a bit of reduction in number of small blobs stored in BlobStore. > Which should reduce the GC time specially for S3 > * Reduced overhead of storing a single file in repository. Instead of array > of 1k blobids we would be stored a single blobid > * Potential improvement in IO cost as file can be read in one connection and > uploaded in one. > *Cons* > It would not be possible to use OakDirectory directly (or would be very slow) > and we would always need to do local copy. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (OAK-6513) Journal based Async Indexer
[ https://issues.apache.org/jira/browse/OAK-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chetan Mehrotra updated OAK-6513: - Description: Current async indexer design is based on NodeState diff. This has served us fine so far however off late it is not able to perform well if rate of repository writes is high. When changes happen faster than index-update can process them, larger and larger diffs will happen. These make index-updates slower, which again lead to the next diff being ever larger than the one before (assuming a constant ingestion rate). In current diff based flow the indexer performs complete diff for all changes happening between 2 cycle. It may happen that lots of writes happens but not much indexable content is written. So doing diff there is a wasted effort. In 1.6 release for NRT Indexing we implemented a journal based indexing for external changes(OAK-4808, OAK-5430). That approach can be generalized and used for async indexing. Before talking about the journal based approach lets see how IndexEditor work currently h4. IndexEditor Currently any IndexEditor performs 2 tasks # Identify which node is to be indexed based on some index definition. The Editor gets invoked as part of content diff where it determines which NodeState is to be indexed # Update the index based on node to be indexed For e.g. in oak-lucene we have LuceneIndexEditor which identifies the NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene Document from NodeState to be indexed. For journal based approach we can decouple these 2 parts and thus have * IndexEditor - Identifies which all paths need to be indexed for given index definition * IndexUpdater - Updates the index based on given NodeState and its path h4. High Level Flow # Session Commit Flow ## Each index type would provide a IndexEditor which would be invoked as part of commit (like sync indexes). These IndexEditor would just determine which paths needs to be indexed. ## As part of commit the paths to be indexed would be written to journal. # AsyncIndexUpdate flow ## AsyncIndexUpdate would query this journal to fetch all such indexed paths between the 2 checkpoints ## Based on the index path data it would invoke the {{IndexUpdater}} to update the index for that path ## Merge the index updates h4. Benefits Such a design would have following impact # More work done as part of write # Marking of indexable content is distributed hence at indexing time lesser work to be done # Indexing can progress in batches # The indexers can be called in parallel h4. Journal Implementation DocumentNodeStore currently has an in built journal which is being used for NRT Indexing. That feature can be exposed as an api. For scaling index this design is mostly required for cluster case. So we can possibly have both indexing support implemented and use the journal based support for DocumentNodeStore setups. Or we can look into implementing such a journal for SegmentNodeStore setups also h4. Open Points * Journal support in SegmentNodeStore * Handling deletes. was: Current async indexer design is based on NodeState diff. This has served us fine so far however off late it is not able to perform well if rate of repository writes is high. When changes happen faster than index-update can process them, larger and larger diffs will happen. These make index-updates slower, which again lead to the next diff being ever larger than the one before (assuming a constant ingestion rate). In current diff based flow the indexer performs complete diff for all changes happening between 2 cycle. It may happen that lots of writes happens but not much indexable content is written. So doing diff there is a wasted effort. In 1.6 release for NRT Indexing we implemented a journal based indexing for external changes(OAK-4808, OAK-5430). That approach can be generalized and used for async indexing. Before talking about the journal based approach lets see how IndexEditor work currently h4. IndexEditor Currently any IndexEditor performs 2 tasks # Identify which node is to be indexed based on some index definition. The Editor gets invoked as part of content diff where it determines which NodeState is to be indexed # Update the index based on node to be indexed For e.g. in oak-lucene we have LuceneIndexEditor which identifies the NodeStates to be indexed and LuceneDocumentMaker which constructs the Lucene Document from NodeState to be indexed. For journal based approach we can decouple these 2 parts and thus have * IndexEditor - Identifies which all paths need to be indexed for given index definition * IndexUpdater - Updates the index based on given NodeState and its path h4. High Level Flow # Session Commit Flow ## Each index type would provide a IndexEditor which would be invoked as part of commit (like sync indexes). These