[jira] [Resolved] (OAK-10804) Indexing job: optimize check for hidden nodes
[ https://issues.apache.org/jira/browse/OAK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10804. --- Fix Version/s: 1.64.0 Resolution: Done > Indexing job: optimize check for hidden nodes > - > > Key: OAK-10804 > URL: https://issues.apache.org/jira/browse/OAK-10804 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.64.0 > > > While downloading the repository from Mongo, the indexing job has to discard > hidden entries. This is being done by a call to > {{{}NodeStateUtils.isHiddenPath(){}}}. This call is rather expensive, as it > creates an iterator over the path segments, which requires creating a new > string for each path segment. As the indexing job has to check every entry to > verify if it is hidden, this creates a significant overhead. > The implementation of checking for hidden paths can be replaced by a simple > search for {{"/:"}} in the string representing the path, which requires no > object allocation and should therefore be much faster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10810) Remove redundant call to StringCache.get in Path.fromString()
[ https://issues.apache.org/jira/browse/OAK-10810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10810. --- Fix Version/s: 1.64.0 Resolution: Done > Remove redundant call to StringCache.get in Path.fromString() > - > > Key: OAK-10810 > URL: https://issues.apache.org/jira/browse/OAK-10810 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: candidate_oak_1_22 > Fix For: 1.64.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10810) Remove redundant call to StringCache.get in Path.fromString()
Nuno Santos created OAK-10810: - Summary: Remove redundant call to StringCache.get in Path.fromString() Key: OAK-10810 URL: https://issues.apache.org/jira/browse/OAK-10810 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OAK-10778) Indexing job: support parallel download from MongoDB with two connections in Pipelined strategy
[ https://issues.apache.org/jira/browse/OAK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846831#comment-17846831 ] Nuno Santos commented on OAK-10778: --- Fix here: [https://github.com/apache/jackrabbit-oak/pull/1463] > Indexing job: support parallel download from MongoDB with two connections in > Pipelined strategy > --- > > Key: OAK-10778 > URL: https://issues.apache.org/jira/browse/OAK-10778 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Assignee: Nuno Santos >Priority: Major > Fix For: 1.64.0 > > > The current version of the Pipelined download strategy uses a single > connection/thread to download from MongoDB. We can further increase the > download speed by using an additional MongoDB connection. A Mongo deployment > has 1 primary and 2 secondaries, so in principle we could have 1 connection > to each secondary, effectively doubling the download speed. > There are a few points to observe: > - Connections should go to different secondaries. If both connections go to > the same secondary, there's a high change that they will be limited by what a > single replica can provide and of overloading that replica. So each secondary > should have one and only one connection. > - How to partition the range of documents to download between two threads. > We are already downloading from Mongo in order of {{(_modified, _id)}}. A > simple and effective partition strategy for 2 connections is for one to > download in ascending and the other in descending order. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10808) PipelinedMongoConnectionFailureIT should not fail if Mongo is not available
Nuno Santos created OAK-10808: - Summary: PipelinedMongoConnectionFailureIT should not fail if Mongo is not available Key: OAK-10808 URL: https://issues.apache.org/jira/browse/OAK-10808 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10788) Indexing job downloader: shutdown gracefully all threads in case of failure
[ https://issues.apache.org/jira/browse/OAK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10788. --- Fix Version/s: 1.64.0 Resolution: Done > Indexing job downloader: shutdown gracefully all threads in case of failure > --- > > Key: OAK-10788 > URL: https://issues.apache.org/jira/browse/OAK-10788 > Project: Jackrabbit Oak > Issue Type: Bug > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.64.0 > > > If the download fails, the threads created by the Pipeline strategy are not > all being correctly shutdown, some of them may be left behind. As they are > all daemon threads, they will not prevent the JVM from shutting down. But > when they are forcibly closed at the JVM shutdown, they print in the logs > several exceptions (connections closed abruptly, trying to access objects > that were already closed) that are confusing and distract from the root cause > of the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10804) Indexing job: optimize check for hidden nodes
[ https://issues.apache.org/jira/browse/OAK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10804: -- Description: While downloading the repository from Mongo, the indexing job has to discard hidden entries. This is being done by a call to {{{}NodeStateUtils.isHiddenPath(){}}}. This call is rather expensive, as it creates an iterator over the path segments, which requires creating a new string for each path segment. As the indexing job has to check every entry to verify if it is hidden, this creates a significant overhead. The implementation of checking for hidden paths can be replaced by a simple search for {{"/:"}} in the string representing the path, which requires no object allocation and should therefore be much faster. was: While downloading the repository from Mongo, the indexing job has to discard hidden entries. This is being done by a call to `NodeStateUtils.isHiddenPath()`. This call is rather expensive, as it creates an iterator over the path segments, which requires creating a new string for each path segment. As the indexing job has to check every entry to verify if it is hidden, this creates a significant overhead. The implementation of checking for hidden paths can be replaced by a simple search for {{"/:"}} in the string representing the path, which requires no object allocation and should therefore be much faster. > Indexing job: optimize check for hidden nodes > - > > Key: OAK-10804 > URL: https://issues.apache.org/jira/browse/OAK-10804 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > While downloading the repository from Mongo, the indexing job has to discard > hidden entries. This is being done by a call to > {{{}NodeStateUtils.isHiddenPath(){}}}. This call is rather expensive, as it > creates an iterator over the path segments, which requires creating a new > string for each path segment. As the indexing job has to check every entry to > verify if it is hidden, this creates a significant overhead. > The implementation of checking for hidden paths can be replaced by a simple > search for {{"/:"}} in the string representing the path, which requires no > object allocation and should therefore be much faster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10804) Indexing job: optimize check for hidden nodes
[ https://issues.apache.org/jira/browse/OAK-10804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10804: -- Summary: Indexing job: optimize check for hidden nodes (was: Indexing job: optimize check for if a node is hidden) > Indexing job: optimize check for hidden nodes > - > > Key: OAK-10804 > URL: https://issues.apache.org/jira/browse/OAK-10804 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > While downloading the repository from Mongo, the indexing job has to discard > hidden entries. This is being done by a call to > `NodeStateUtils.isHiddenPath()`. This call is rather expensive, as it creates > an iterator over the path segments, which requires creating a new string for > each path segment. As the indexing job has to check every entry to verify if > it is hidden, this creates a significant overhead. > The implementation of checking for hidden paths can be replaced by a simple > search for {{"/:"}} in the string representing the path, which requires no > object allocation and should therefore be much faster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10804) Indexing job: optimize check for if a node is hidden
Nuno Santos created OAK-10804: - Summary: Indexing job: optimize check for if a node is hidden Key: OAK-10804 URL: https://issues.apache.org/jira/browse/OAK-10804 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos While downloading the repository from Mongo, the indexing job has to discard hidden entries. This is being done by a call to `NodeStateUtils.isHiddenPath()`. This call is rather expensive, as it creates an iterator over the path segments, which requires creating a new string for each path segment. As the indexing job has to check every entry to verify if it is hidden, this creates a significant overhead. The implementation of checking for hidden paths can be replaced by a simple search for {{"/:"}} in the string representing the path, which requires no object allocation and should therefore be much faster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10796) Avoid creation of intermediate StringBuilder in JsopBuilder
[ https://issues.apache.org/jira/browse/OAK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10796. --- Fix Version/s: 1.64.0 Resolution: Done > Avoid creation of intermediate StringBuilder in JsopBuilder > --- > > Key: OAK-10796 > URL: https://issues.apache.org/jira/browse/OAK-10796 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: commons >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.64.0 > > > Since invokedynamic was added to Java, this code in the JsopBuilder class: > {code:java} > StringBuilder buff = new StringBuilder(length + 2); > return buff.append('\"').append(s).append('\"').toString(); {code} > can be written more simply and more efficiently like this: > {code:java} > return '\"' + s + '\"'; > {code} > https://www.baeldung.com/java-string-concatenation-invoke-dynamic -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10795) Indexing job: eliminate unnecessary intermediate object creation in transform stage
[ https://issues.apache.org/jira/browse/OAK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10795. --- Fix Version/s: 1.64.0 Resolution: Done > Indexing job: eliminate unnecessary intermediate object creation in transform > stage > --- > > Key: OAK-10795 > URL: https://issues.apache.org/jira/browse/OAK-10795 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.64.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10778) Indexing job: support parallel download from MongoDB with two connections in Pipelined strategy
[ https://issues.apache.org/jira/browse/OAK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10778. --- Fix Version/s: 1.64.0 Resolution: Done > Indexing job: support parallel download from MongoDB with two connections in > Pipelined strategy > --- > > Key: OAK-10778 > URL: https://issues.apache.org/jira/browse/OAK-10778 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.64.0 > > > The current version of the Pipelined download strategy uses a single > connection/thread to download from MongoDB. We can further increase the > download speed by using an additional MongoDB connection. A Mongo deployment > has 1 primary and 2 secondaries, so in principle we could have 1 connection > to each secondary, effectively doubling the download speed. > There are a few points to observe: > - Connections should go to different secondaries. If both connections go to > the same secondary, there's a high change that they will be limited by what a > single replica can provide and of overloading that replica. So each secondary > should have one and only one connection. > - How to partition the range of documents to download between two threads. > We are already downloading from Mongo in order of {{(_modified, _id)}}. A > simple and effective partition strategy for 2 connections is for one to > download in ascending and the other in descending order. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10796) Avoid creation of intermediate StringBuilder in JsopBuilder
[ https://issues.apache.org/jira/browse/OAK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10796: -- Summary: Avoid creation of intermediate StringBuilder in JsopBuilder (was: Avoid creation of intermedia Stringbuffer in JsopBuilder) > Avoid creation of intermediate StringBuilder in JsopBuilder > --- > > Key: OAK-10796 > URL: https://issues.apache.org/jira/browse/OAK-10796 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > Since invokedynamic was added to Java, this code in the JsopBuilder class: > {code:java} > StringBuilder buff = new StringBuilder(length + 2); > return buff.append('\"').append(s).append('\"').toString(); {code} > can be written more simply and more efficiently like this: > {code:java} > return '\"' + s + '\"'; > {code} > https://www.baeldung.com/java-string-concatenation-invoke-dynamic -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10796) Avoid creation of intermedia Stringbuffer in JsopBuilder
[ https://issues.apache.org/jira/browse/OAK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10796: -- Description: Since invokedynamic was added to Java, this code in the JsopBuilder class: {code:java} StringBuilder buff = new StringBuilder(length + 2); return buff.append('\"').append(s).append('\"').toString(); {code} can be written more simply and more efficiently like this: {code:java} return '\"' + s + '\"'; {code} https://www.baeldung.com/java-string-concatenation-invoke-dynamic was: Since invokedynamic was added to Java, this code in the JsopBuilder class: {code:java} StringBuilder buff = new StringBuilder(length + 2); return buff.append('\"').append(s).append('\"').toString(); {code} can be written more simply and more efficiently like this: {code:java} return '\"' + s + '\"'; {code:java} https://www.baeldung.com/java-string-concatenation-invoke-dynamic > Avoid creation of intermedia Stringbuffer in JsopBuilder > > > Key: OAK-10796 > URL: https://issues.apache.org/jira/browse/OAK-10796 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > Since invokedynamic was added to Java, this code in the JsopBuilder class: > {code:java} > StringBuilder buff = new StringBuilder(length + 2); > return buff.append('\"').append(s).append('\"').toString(); {code} > can be written more simply and more efficiently like this: > {code:java} > return '\"' + s + '\"'; > {code} > https://www.baeldung.com/java-string-concatenation-invoke-dynamic -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10796) Avoid creation of intermedia Stringbuffer in JsopBuilder
Nuno Santos created OAK-10796: - Summary: Avoid creation of intermedia Stringbuffer in JsopBuilder Key: OAK-10796 URL: https://issues.apache.org/jira/browse/OAK-10796 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos Since invokedynamic was added to Java, this code in the JsopBuilder class: {code:java} StringBuilder buff = new StringBuilder(length + 2); return buff.append('\"').append(s).append('\"').toString(); {code} can be written more simply and more efficiently like this: {code:java} return '\"' + s + '\"'; {code:java} https://www.baeldung.com/java-string-concatenation-invoke-dynamic -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10795) Indexing job: eliminate unnecessary intermediate object creation in transform stage
Nuno Santos created OAK-10795: - Summary: Indexing job: eliminate unnecessary intermediate object creation in transform stage Key: OAK-10795 URL: https://issues.apache.org/jira/browse/OAK-10795 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10789) Indexing job: log paths used for inclusing/exclusion for Mongo regex filters in job summary
Nuno Santos created OAK-10789: - Summary: Indexing job: log paths used for inclusing/exclusion for Mongo regex filters in job summary Key: OAK-10789 URL: https://issues.apache.org/jira/browse/OAK-10789 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos The paths applied in the regex filter should be logged in the indexing report. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10788) Indexing job downloader: shutdown gracefully all threads in case of failure
Nuno Santos created OAK-10788: - Summary: Indexing job downloader: shutdown gracefully all threads in case of failure Key: OAK-10788 URL: https://issues.apache.org/jira/browse/OAK-10788 Project: Jackrabbit Oak Issue Type: Bug Components: indexing Reporter: Nuno Santos If the download fails, the threads created by the Pipeline strategy are not all being correctly shutdown, some of them may be left behind. As they are all daemon threads, they will not prevent the JVM from shutting down. But when they are forcibly closed at the JVM shutdown, they print in the logs several exceptions (connections closed abruptly, trying to access objects that were already closed) that are confusing and distract from the root cause of the problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10778) Indexing job: support parallel download from MongoDB with two connections in Pipelined strategy
Nuno Santos created OAK-10778: - Summary: Indexing job: support parallel download from MongoDB with two connections in Pipelined strategy Key: OAK-10778 URL: https://issues.apache.org/jira/browse/OAK-10778 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos The current version of the Pipelined download strategy uses a single connection/thread to download from MongoDB. We can further increase the download speed by using an additional MongoDB connection. A Mongo deployment has 1 primary and 2 secondaries, so in principle we could have 1 connection to each secondary, effectively doubling the download speed. There are a few points to observe: - Connections should go to different secondaries. If both connections go to the same secondary, there's a high change that they will be limited by what a single replica can provide and of overloading that replica. So each secondary should have one and only one connection. - How to partition the range of documents to download between two threads. We are already downloading from Mongo in order of {{(_modified, _id)}}. A simple and effective partition strategy for 2 connections is for one to download in ascending and the other in descending order. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10682) [Indexing job] Improve Mongo regex filter to only use positive conditions (no negations)
[ https://issues.apache.org/jira/browse/OAK-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10682. --- Fix Version/s: 1.62.0 Resolution: Done > [Indexing job] Improve Mongo regex filter to only use positive conditions (no > negations) > > > Key: OAK-10682 > URL: https://issues.apache.org/jira/browse/OAK-10682 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing > Environment: The current implementation of filtering excluded paths > and custom regex is using a condition like > {noformat} > { _id: { $nin: [ /^[0-9]{1,3}:\/content\/dam\/.*$/ ]} {noformat} > Mongo cannot evaluate this condition without retrieving the full document, > because a value of {{_null}} would also match this condition and the index > does not contain {{null}} values. Therefore, when the index contains excluded > paths, the download will be much slower because Mongo has to retrieve every > single document to evaluate the condition. > As a workaround, we can transform the regex on an equivalent one that matches > the complement of the original regex using [negative > lookahead|https://stackoverflow.com/questions/1240275/how-to-negate-specific-word-in-regex]. > This allows rewriting the filter condition using only positive conditions, > which can be evaluated using only the index. >Reporter: Nuno Santos >Priority: Major > Labels: indexing > Fix For: 1.62.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10681) [indexing job] Support custom filters of paths on Mongo
[ https://issues.apache.org/jira/browse/OAK-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10681. --- Fix Version/s: 1.62.0 Resolution: Done > [indexing job] Support custom filters of paths on Mongo > --- > > Key: OAK-10681 > URL: https://issues.apache.org/jira/browse/OAK-10681 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: indexing > Fix For: 1.62.0 > > > The indexing job often has to download parts of the Oak tree that are not > needed for the indexes being indexed. The index definitions can define > included/excludedPaths to control which parts of the tree to index and the > indexing job uses this to filter on Mongo the documents that should be sent. > But often there are subtrees in Oak that are not needed for indexing but are > not excluded by the index definitions, for instance, subtrees with hidden > binary data. > This task is to add a configuration option to Oak to specify a list of > subtrees that should not be downloaded in any case, for instance: > {noformat} > customExcludedPaths = "/foo;/tmp/bar" {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10682) [Indexing job] Improve Mongo regex filter to only use positive conditions (no negations)
Nuno Santos created OAK-10682: - Summary: [Indexing job] Improve Mongo regex filter to only use positive conditions (no negations) Key: OAK-10682 URL: https://issues.apache.org/jira/browse/OAK-10682 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Environment: The current implementation of filtering excluded paths and custom regex is using a condition like {noformat} { _id: { $nin: [ /^[0-9]{1,3}:\/content\/dam\/.*$/ ]} {noformat} Mongo cannot evaluate this condition without retrieving the full document, because a value of {{_null}} would also match this condition and the index does not contain {{null}} values. Therefore, when the index contains excluded paths, the download will be much slower because Mongo has to retrieve every single document to evaluate the condition. As a workaround, we can transform the regex on an equivalent one that matches the complement of the original regex using [negative lookahead|https://stackoverflow.com/questions/1240275/how-to-negate-specific-word-in-regex]. This allows rewriting the filter condition using only positive conditions, which can be evaluated using only the index. Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10681) [indexing job] Support custom filters of paths on Mongo
Nuno Santos created OAK-10681: - Summary: [indexing job] Support custom filters of paths on Mongo Key: OAK-10681 URL: https://issues.apache.org/jira/browse/OAK-10681 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos The indexing job often has to download parts of the Oak tree that are not needed for the indexes being indexed. The index definitions can define included/excludedPaths to control which parts of the tree to index and the indexing job uses this to filter on Mongo the documents that should be sent. But often there are subtrees in Oak that are not needed for indexing but are not excluded by the index definitions, for instance, subtrees with hidden binary data. This task is to add a configuration option to Oak to specify a list of subtrees that should not be downloaded in any case, for instance: {noformat} customExcludedPaths = "/foo;/tmp/bar" {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10671) [Indexing job] - Improve Mongo regex query: remove condition on non-indexed _path field to speedup traversal
[ https://issues.apache.org/jira/browse/OAK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10671. --- Fix Version/s: 1.62.0 Resolution: Done > [Indexing job] - Improve Mongo regex query: remove condition on non-indexed > _path field to speedup traversal > > > Key: OAK-10671 > URL: https://issues.apache.org/jira/browse/OAK-10671 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.62.0 > > > Regex path filtering currently is implemented with a condition like: > {noformat} > _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$] OR ('_id' in [^[0-9]{1,3}:h.*$}] AND > _path in [^\Q/foo/bar/\E.*$] > {noformat} > The second condition is necessary to deal with long path documents, whose > {{_id}} is an hash instead of the path of the document, and that have an > additional {{_path}} property with the full path of the document. The {{_id}} > field is part of the index used by the query, but {{_path}} is not indexed. > So the performance of this query will be very sensitive to how many time the > query condition can be resolved without having to lookup the value of > {{{}_path{}}}, which requires retrieving the full document from the column > store. If the condition can be evaluated only using the {{_id}} value, them > if there is no match the document should not be retrieved from the column > store. > Unfortunately, Mongo does not seem to properly optimize this query and is > retrieving the document from the column storage even when {{_id}} does not > match the path /foo/bar and the _id is not in the hash format. This leads to > very poor performance as both the index and the column store have to be fully > read by this query. > We can instead use the following condition: > {noformat} > _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$ , ^[0-9]{1,3}:h.*$}] > {noformat} > That is, download the document if the _id matches the path or if it is an > hash. This has the disadvantage that it will download all long path documents > from the repository, many of which might not be needed. However, this query > condition only uses the _id field so it is guaranteed to be evaluated fully > using only the data on the index. And the number of long paths documents is > usually very small, some environments don't even have any long path > documents, so downloading them should not take much time. And the indexing > job will anyway reapply the filter on paths locally, to eliminate the long > path documents which are not required by the indexing job. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10671) [Indexing job] - Improve Mongo regex query: remove condition on non-indexed _path field to speedup traversal
Nuno Santos created OAK-10671: - Summary: [Indexing job] - Improve Mongo regex query: remove condition on non-indexed _path field to speedup traversal Key: OAK-10671 URL: https://issues.apache.org/jira/browse/OAK-10671 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos Regex path filtering currently is implemented with a condition like: {noformat} _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$] OR ('_id' in [^[0-9]{1,3}:h.*$}] AND _path in [^\Q/foo/bar/\E.*$] {noformat} The second condition is necessary to deal with long path documents, whose {{_id}} is an hash instead of the path of the document, and that have an additional {{_path}} property with the full path of the document. The {{_id}} field is part of the index used by the query, but {{_path}} is not indexed. So the performance of this query will be very sensitive to how many time the query condition can be resolved without having to lookup the value of {{{}_path{}}}, which requires retrieving the full document from the column store. If the condition can be evaluated only using the {{_id}} value, them if there is no match the document should not be retrieved from the column store. Unfortunately, Mongo does not seem to properly optimize this query and is retrieving the document from the column storage even when {{_id}} does not match the path /foo/bar and the _id is not in the hash format. This leads to very poor performance as both the index and the column store have to be fully read by this query. We can instead use the following condition: {noformat} _id in [^[0-9]{1,3}:\Q/foo/bar/\E.*$ , ^[0-9]{1,3}:h.*$}] {noformat} That is, download the document if the _id matches the path or if it is an hash. This has the disadvantage that it will download all long path documents from the repository, many of which might not be needed. However, this query condition only uses the _id field so it is guaranteed to be evaluated fully using only the data on the index. And the number of long paths documents is usually very small, some environments don't even have any long path documents, so downloading them should not take much time. And the indexing job will anyway reapply the filter on paths locally, to eliminate the long path documents which are not required by the indexing job. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10637) Indexing job/regex path filtering - when / is the only included path, do not add an explicit filter
[ https://issues.apache.org/jira/browse/OAK-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10637. --- Fix Version/s: 1.62.0 Resolution: Done > Indexing job/regex path filtering - when / is the only included path, do not > add an explicit filter > --- > > Key: OAK-10637 > URL: https://issues.apache.org/jira/browse/OAK-10637 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.62.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10620) Print summary at the end of the indexing job
[ https://issues.apache.org/jira/browse/OAK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10620. --- Fix Version/s: 1.62.0 Resolution: Done > Print summary at the end of the indexing job > > > Key: OAK-10620 > URL: https://issues.apache.org/jira/browse/OAK-10620 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.62.0 > > > This summary is intended to have an easy way of copy'n'paste all the relevant > information from a run for keeping a record. With this summary, it should not > be needed to grep/search the logs just to get an overview of the job. > The summary should include: > - Coordinates of the enviroment > - Name of indexes that were indexed > - Time of the different phases (download, sort, index) > - Complete configuration > - Version of the indexing job (aem-ethos-tools and Oak) > - All the metrics collected during the run of the job -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10637) Indexing job/regex path filtering - when / is the only included path, do not add an explicit filter
Nuno Santos created OAK-10637: - Summary: Indexing job/regex path filtering - when / is the only included path, do not add an explicit filter Key: OAK-10637 URL: https://issues.apache.org/jira/browse/OAK-10637 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10608) [Indexing job] Improve regex expression used to download from Mongo to make better used of Mongo indexes
[ https://issues.apache.org/jira/browse/OAK-10608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10608. --- Fix Version/s: 1.62.0 Resolution: Done > [Indexing job] Improve regex expression used to download from Mongo to make > better used of Mongo indexes > > > Key: OAK-10608 > URL: https://issues.apache.org/jira/browse/OAK-10608 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.62.0 > > > The current regex expression used to filter from Mongo the included/excluded > paths has conditions on both the fields \{{_id}} and \{{_path}}. In most > cases, the \{{_id}} field contains the path of the node, but when the path is > too long, the \{{_id}} is replaced by an hash of the path and the full path > is added to the document as an additional \{{_path}} field. For these cases, > the regex expression must also check the \{{_path}} field. > When running an ordered traversal, we use a Mongo index on \{{(_modified, > _id)}}. So checks on \{{_id}} can be done with just the data retrieved from > the index. But for the check on \{{_path}}, Mongo needs to read the full > document from the column store, which slows down significantly the traversal. > Currently, if \{{_id}} does not match, the regex expression will always check > \{{_path}}, forcing a retrieval of the document. But we only need to check > \{{_path}} if the \{{_id}} is of the form of a long path id, that is, of the > pattern \{{4:h...}}, otherwise, if the _id is not a long path, then if > it does not match the regex, we can be sure that the document is not needed. > The check that \{{_id}} is an hash can be done without retrieving the full > document from the column store, so it will be fast. And in the common case, > the document is not a long path, so this simple check will avoid retrieving > the document from the column store. > This optimization will have a bit impact when the regex expression matches a > small fraction of the repository. In the current implementation, Mongo has to > traverse both the index and the column store for all possible regex filters. > But with the additional check for long paths, Mongo has still to traverse the > full index but it will only retrieve from the column store the documents that > match the filter or the long path documents. And since the index is much > smaller than the column store and can more easily be cached, this will > significantly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10620) Print summary at the end of the indexing job
Nuno Santos created OAK-10620: - Summary: Print summary at the end of the indexing job Key: OAK-10620 URL: https://issues.apache.org/jira/browse/OAK-10620 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos This summary is intended to have an easy way of copy'n'paste all the relevant information from a run for keeping a record. With this summary, it should not be needed to grep/search the logs just to get an overview of the job. The summary should include: - Coordinates of the enviroment - Name of indexes that were indexed - Time of the different phases (download, sort, index) - Complete configuration - Version of the indexing job (aem-ethos-tools and Oak) - All the metrics collected during the run of the job -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10592) [Indexing job] Add a regex filter to exclude matching entries from being downloaded from Mongo
[ https://issues.apache.org/jira/browse/OAK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10592: -- Summary: [Indexing job] Add a regex filter to exclude matching entries from being downloaded from Mongo (was: Ignore FVs nodes when downloading from Mongo) > [Indexing job] Add a regex filter to exclude matching entries from being > downloaded from Mongo > -- > > Key: OAK-10592 > URL: https://issues.apache.org/jira/browse/OAK-10592 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10592) [Indexing job] Add a regex filter to exclude matching entries from being downloaded from Mongo
[ https://issues.apache.org/jira/browse/OAK-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10592. --- Resolution: Done > [Indexing job] Add a regex filter to exclude matching entries from being > downloaded from Mongo > -- > > Key: OAK-10592 > URL: https://issues.apache.org/jira/browse/OAK-10592 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10608) [Indexing job] Improve regex expression used to download from Mongo to make better used of Mongo indexes
Nuno Santos created OAK-10608: - Summary: [Indexing job] Improve regex expression used to download from Mongo to make better used of Mongo indexes Key: OAK-10608 URL: https://issues.apache.org/jira/browse/OAK-10608 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos The current regex expression used to filter from Mongo the included/excluded paths has conditions on both the fields \{{_id}} and \{{_path}}. In most cases, the \{{_id}} field contains the path of the node, but when the path is too long, the \{{_id}} is replaced by an hash of the path and the full path is added to the document as an additional \{{_path}} field. For these cases, the regex expression must also check the \{{_path}} field. When running an ordered traversal, we use a Mongo index on \{{(_modified, _id)}}. So checks on \{{_id}} can be done with just the data retrieved from the index. But for the check on \{{_path}}, Mongo needs to read the full document from the column store, which slows down significantly the traversal. Currently, if \{{_id}} does not match, the regex expression will always check \{{_path}}, forcing a retrieval of the document. But we only need to check \{{_path}} if the \{{_id}} is of the form of a long path id, that is, of the pattern \{{4:h...}}, otherwise, if the _id is not a long path, then if it does not match the regex, we can be sure that the document is not needed. The check that \{{_id}} is an hash can be done without retrieving the full document from the column store, so it will be fast. And in the common case, the document is not a long path, so this simple check will avoid retrieving the document from the column store. This optimization will have a bit impact when the regex expression matches a small fraction of the repository. In the current implementation, Mongo has to traverse both the index and the column store for all possible regex filters. But with the additional check for long paths, Mongo has still to traverse the full index but it will only retrieve from the column store the documents that match the filter or the long path documents. And since the index is much smaller than the column store and can more easily be cached, this will significantly improve performance. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10589) Improve regex path filtering to also handle cases where excludedPaths are defined
[ https://issues.apache.org/jira/browse/OAK-10589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10589. --- Fix Version/s: 1.62.0 Resolution: Done > Improve regex path filtering to also handle cases where excludedPaths are > defined > - > > Key: OAK-10589 > URL: https://issues.apache.org/jira/browse/OAK-10589 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.62.0 > > > Currently, we apply regex path filtering in the following case: > * *includedPaths is non empty and excludedPaths is empty* - use a filter on > the Mongo query with every includedPath. > But we can apply path filtering on Mongo in more situations: > * *{{includedPaths}} empty, {{excludedPaths}} non-empty* - This is the > reverse situation of what we currently support, so we can define a Mongo > filter with the list of {{excludedPaths}} and negate it. > * *both includedPaths and excludedPaths are non-empty* - In this case we > can simply ignore the excluded paths and download all included paths. If an > excluded path is outside an included path, it will not be downloaded because > it will not match the included path filters. If an excluded path is a > descendant of an included path, it will be downloaded from Mongo but filtered > in the transform stage before being written to the FlatFileStore. > * *includePaths and excludedPaths are both empty* - In this case we fall > back to downloading everything. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10592) Ignore FVs nodes when downloading from Mongo
Nuno Santos created OAK-10592: - Summary: Ignore FVs nodes when downloading from Mongo Key: OAK-10592 URL: https://issues.apache.org/jira/browse/OAK-10592 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10590) Indexing job downloads and creates FFS with full node store if includedPaths is specified as a string instead of array of strings
[ https://issues.apache.org/jira/browse/OAK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10590. --- Fix Version/s: 1.62.0 Resolution: Done > Indexing job downloads and creates FFS with full node store if includedPaths > is specified as a string instead of array of strings > - > > Key: OAK-10590 > URL: https://issues.apache.org/jira/browse/OAK-10590 > Project: Jackrabbit Oak > Issue Type: Bug > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.62.0 > > > The {{includedPaths}} property of an index definition should be an array of > strings. > If it is instead specified as a String, like in this example: > {noformat} > "includedPaths": "/a/b", {noformat} > The indexing job defaults to using the {{/}} as the value for includedPaths, > and therefore downloads the full node store and creates an FFS containing > everything except the hidden paths. The logic that handles this case is here: > [https://github.com/apache/jackrabbit-oak/blob/0b8f4ab2e736c6561ae745a5fe6040a59911eeb3/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/filter/PathFilter.java#L95-L103] > This will slow down significantly the indexing, as it will negate any > benefits from using regex filtering. And even if regex filtering is not > enabled or cannot be used, using / as includedPaths will also result in the > FFS containing more nodes than it should, which will once again slow down the > indexing job. > Suggested fix: if includedPaths is a String, treat it as a one element array > and at the same time log a warning. > Additionally, apply the same fix to other properties in the index definition: > * {{excludedPaths}} > * {{includedPaths}} > * {{queryPaths}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10590) Indexing job downloads and creates FFS with full node store if includedPaths is specified as a string instead of array of strings
Nuno Santos created OAK-10590: - Summary: Indexing job downloads and creates FFS with full node store if includedPaths is specified as a string instead of array of strings Key: OAK-10590 URL: https://issues.apache.org/jira/browse/OAK-10590 Project: Jackrabbit Oak Issue Type: Bug Components: indexing Reporter: Nuno Santos The {{includedPaths}} property of an index definition should be an array of strings. If it is instead specified as a String, like in this example: {noformat} "includedPaths": "/a/b", {noformat} The indexing job defaults to using the {{/}} as the value for includedPaths, and therefore downloads the full node store and creates an FFS containing everything except the hidden paths. The logic that handles this case is here: [https://github.com/apache/jackrabbit-oak/blob/0b8f4ab2e736c6561ae745a5fe6040a59911eeb3/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/filter/PathFilter.java#L95-L103] This will slow down significantly the indexing, as it will negate any benefits from using regex filtering. And even if regex filtering is not enabled or cannot be used, using / as includedPaths will also result in the FFS containing more nodes than it should, which will once again slow down the indexing job. Suggested fix: if includedPaths is a String, treat it as a one element array and at the same time log a warning. Additionally, apply the same fix to other properties in the index definition: * {{excludedPaths}} * {{includedPaths}} * {{queryPaths}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10571) Names of metrics exported by indexing logic are inconsistent
[ https://issues.apache.org/jira/browse/OAK-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10571. --- Fix Version/s: 1.62.0 Resolution: Done > Names of metrics exported by indexing logic are inconsistent > > > Key: OAK-10571 > URL: https://issues.apache.org/jira/browse/OAK-10571 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.62.0 > > > - Some metrics are fully in snake case, while others are a mix of camel and > snake case. > - The metrics that represent times should all end in {{duration_seconds}} > - The duration of the Mongo dump phase is missing > {noformat} > oak_indexer_full_index_creation_duration_seconds > oak_indexer_indexing_duration_seconds > oak_indexer_merge_node_store_duration_seconds > oak_indexer_pipelined_documentsAccepted > oak_indexer_pipelined_documentsDownloaded > oak_indexer_pipelined_documentsRejected > oak_indexer_pipelined_documentsRejectedEmptyNodeState > oak_indexer_pipelined_documentsRejectedSplit > oak_indexer_pipelined_documentsTraversed > oak_indexer_pipelined_entriesAccepted > oak_indexer_pipelined_entriesRejected > oak_indexer_pipelined_entriesRejectedHiddenPaths > oak_indexer_pipelined_entriesRejectedPathFiltered > oak_indexer_pipelined_entriesTraversed > oak_indexer_pipelined_extractedEntriesTotalSize > oak_indexer_pipelined_mergeSortEagerMergesRuns > oak_indexer_pipelined_mergeSortFinalMergeFilesCount > oak_indexer_pipelined_mergeSortFinalMergeTime > oak_indexer_pipelined_mergeSortIntermediateFilesCount > oak_indexer_pipelined_mongoDownloadEnqueueDelayPercentage > oak_indexer_import_bring_index_uptodate_duration_seconds > oak_indexer_import_import_index_data_duration_seconds > oak_indexer_import_release_checkpoint_duration_seconds > oak_indexer_import_switch_lane_duration_seconds > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10589) Improve regex path filtering to also handle cases where excludedPaths are defined
Nuno Santos created OAK-10589: - Summary: Improve regex path filtering to also handle cases where excludedPaths are defined Key: OAK-10589 URL: https://issues.apache.org/jira/browse/OAK-10589 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos Currently, we apply regex path filtering in the following case: * *includedPaths is non empty and excludedPaths is empty* - use a filter on the Mongo query with every includedPath. But we can apply path filtering on Mongo in more situations: * *{{includedPaths}} empty, {{excludedPaths}} non-empty* - This is the reverse situation of what we currently support, so we can define a Mongo filter with the list of {{excludedPaths}} and negate it. * *both includedPaths and excludedPaths are non-empty* - In this case we can simply ignore the excluded paths and download all included paths. If an excluded path is outside an included path, it will not be downloaded because it will not match the included path filters. If an excluded path is a descendant of an included path, it will be downloaded from Mongo but filtered in the transform stage before being written to the FlatFileStore. * *includePaths and excludedPaths are both empty* - In this case we fall back to downloading everything. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10580) Indexing job: improve regex path filtering, support multiple includedPaths
[ https://issues.apache.org/jira/browse/OAK-10580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10580. --- Fix Version/s: 1.62.0 Resolution: Done > Indexing job: improve regex path filtering, support multiple includedPaths > -- > > Key: OAK-10580 > URL: https://issues.apache.org/jira/browse/OAK-10580 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.62.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10438) Remove MULTTHREADED_TRAVERSE_WITH_SORT download strategy
[ https://issues.apache.org/jira/browse/OAK-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10438: -- Summary: Remove MULTTHREADED_TRAVERSE_WITH_SORT download strategy (was: Remove deprecated download strategies) > Remove MULTTHREADED_TRAVERSE_WITH_SORT download strategy > > > Key: OAK-10438 > URL: https://issues.apache.org/jira/browse/OAK-10438 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: Indexing > Fix For: 1.62.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10438) Remove deprecated download strategies
[ https://issues.apache.org/jira/browse/OAK-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10438. --- Fix Version/s: 1.62.0 Resolution: Done > Remove deprecated download strategies > - > > Key: OAK-10438 > URL: https://issues.apache.org/jira/browse/OAK-10438 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: Indexing > Fix For: 1.62.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10580) Indexing job: improve regex path filtering, support multiple includedPaths
Nuno Santos created OAK-10580: - Summary: Indexing job: improve regex path filtering, support multiple includedPaths Key: OAK-10580 URL: https://issues.apache.org/jira/browse/OAK-10580 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10571) Names of metrics exported by indexing logic are inconsistent
Nuno Santos created OAK-10571: - Summary: Names of metrics exported by indexing logic are inconsistent Key: OAK-10571 URL: https://issues.apache.org/jira/browse/OAK-10571 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos - Some metrics are fully in snake case, while others are a mix of camel and snake case. - The metrics that represent times should all end in {{duration_seconds}} - The duration of the Mongo dump phase is missing {noformat} oak_indexer_full_index_creation_duration_seconds oak_indexer_indexing_duration_seconds oak_indexer_merge_node_store_duration_seconds oak_indexer_pipelined_documentsAccepted oak_indexer_pipelined_documentsDownloaded oak_indexer_pipelined_documentsRejected oak_indexer_pipelined_documentsRejectedEmptyNodeState oak_indexer_pipelined_documentsRejectedSplit oak_indexer_pipelined_documentsTraversed oak_indexer_pipelined_entriesAccepted oak_indexer_pipelined_entriesRejected oak_indexer_pipelined_entriesRejectedHiddenPaths oak_indexer_pipelined_entriesRejectedPathFiltered oak_indexer_pipelined_entriesTraversed oak_indexer_pipelined_extractedEntriesTotalSize oak_indexer_pipelined_mergeSortEagerMergesRuns oak_indexer_pipelined_mergeSortFinalMergeFilesCount oak_indexer_pipelined_mergeSortFinalMergeTime oak_indexer_pipelined_mergeSortIntermediateFilesCount oak_indexer_pipelined_mongoDownloadEnqueueDelayPercentage oak_indexer_import_bring_index_uptodate_duration_seconds oak_indexer_import_import_index_data_duration_seconds oak_indexer_import_release_checkpoint_duration_seconds oak_indexer_import_switch_lane_duration_seconds {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OAK-10541) Pipelined strategy: improve memory management of transform stage
[ https://issues.apache.org/jira/browse/OAK-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17789364#comment-17789364 ] Nuno Santos commented on OAK-10541: --- Now that the version increase was rolled back, can we resolve this issue? > Pipelined strategy: improve memory management of transform stage > > > Key: OAK-10541 > URL: https://issues.apache.org/jira/browse/OAK-10541 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Assignee: Julian Reschke >Priority: Major > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10541) Pipelined strategy: improve memory management of transform stage
[ https://issues.apache.org/jira/browse/OAK-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10541. --- Fix Version/s: 1.60.0 Resolution: Done > Pipelined strategy: improve memory management of transform stage > > > Key: OAK-10541 > URL: https://issues.apache.org/jira/browse/OAK-10541 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10519) Export metrics from indexing job
[ https://issues.apache.org/jira/browse/OAK-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10519. --- Fix Version/s: 1.60.0 Resolution: Done > Export metrics from indexing job > > > Key: OAK-10519 > URL: https://issues.apache.org/jira/browse/OAK-10519 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10554) LastRevRecoveryRandomizedIT test seems to be flaky
Nuno Santos created OAK-10554: - Summary: LastRevRecoveryRandomizedIT test seems to be flaky Key: OAK-10554 URL: https://issues.apache.org/jira/browse/OAK-10554 Project: Jackrabbit Oak Issue Type: Bug Reporter: Nuno Santos The failure below was observed in a CI in a run that was preceded and followed by other runs where the test did not fail, without apparently any change to the code that could have affected this test. {noformat} 14:53:23 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.247 s <<< FAILURE! - in org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryRandomizedIT 14:53:23 [ERROR] randomized(org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryRandomizedIT) Time elapsed: 1.247 s <<< ERROR! 14:53:23 org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: Configured cluster node id 1 already in use: needs recovery and was unable to perform it myself 14:53:23at org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo.createInstance(ClusterNodeInfo.java:629) 14:53:23at org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo.getInstance(ClusterNodeInfo.java:471) 14:53:23at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.(DocumentNodeStore.java:607) 14:53:23at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBuilder.build(DocumentNodeStoreBuilder.java:176) 14:53:23at org.apache.jackrabbit.oak.plugins.document.DocumentMK$Builder.getNodeStore(DocumentMK.java:481) 14:53:23at org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryRandomizedIT.checkStore(LastRevRecoveryRandomizedIT.java:262) 14:53:23at org.apache.jackrabbit.oak.plugins.document.LastRevRecoveryRandomizedIT.randomized(LastRevRecoveryRandomizedIT.java:133) 14:53:23at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10538) Pipeline strategy: eliminate unnecessary intermediate copy of entries in transform stage
[ https://issues.apache.org/jira/browse/OAK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10538. --- Fix Version/s: 1.60.0 Resolution: Done > Pipeline strategy: eliminate unnecessary intermediate copy of entries in > transform stage > > > Key: OAK-10538 > URL: https://issues.apache.org/jira/browse/OAK-10538 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10547) Indexing job fails at the end of reindexing if it took more than 24h to run
[ https://issues.apache.org/jira/browse/OAK-10547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10547. --- Fix Version/s: 1.60.0 Resolution: Done > Indexing job fails at the end of reindexing if it took more than 24h to run > --- > > Key: OAK-10547 > URL: https://issues.apache.org/jira/browse/OAK-10547 > Project: Jackrabbit Oak > Issue Type: Bug > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.60.0 > > > {noformat} > 10:21:04.989 [main] ERROR com.adobe.granite.indexing.tool.Main - Can't > perform operation > java.time.DateTimeException: Invalid value for SecondOfDay (valid values 0 - > 86399): 98646 > at > java.base/java.time.temporal.ValueRange.checkValidValue(ValueRange.java:311) > at > java.base/java.time.temporal.ChronoField.checkValidValue(ChronoField.java:717) > at java.base/java.time.LocalTime.ofSecondOfDay(LocalTime.java:380) > at > org.apache.jackrabbit.oak.plugins.index.FormattingUtils.formatToSeconds(FormattingUtils.java:27) > at > org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.reindex(DocumentStoreIndexerBase.java:303) > at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:195) > at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:134) > at com.adobe.granite.indexing.tool.Main.main(Main.java:112) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10547) Indexing job fails at the end of reindexing if it took more than 24h to run
[ https://issues.apache.org/jira/browse/OAK-10547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10547: -- Description: {noformat} 10:21:04.989 [main] ERROR com.adobe.granite.indexing.tool.Main - Can't perform operation java.time.DateTimeException: Invalid value for SecondOfDay (valid values 0 - 86399): 98646 at java.base/java.time.temporal.ValueRange.checkValidValue(ValueRange.java:311) at java.base/java.time.temporal.ChronoField.checkValidValue(ChronoField.java:717) at java.base/java.time.LocalTime.ofSecondOfDay(LocalTime.java:380) at org.apache.jackrabbit.oak.plugins.index.FormattingUtils.formatToSeconds(FormattingUtils.java:27) at org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.reindex(DocumentStoreIndexerBase.java:303) at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:195) at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:134) at com.adobe.granite.indexing.tool.Main.main(Main.java:112) {noformat} > Indexing job fails at the end of reindexing if it took more than 24h to run > --- > > Key: OAK-10547 > URL: https://issues.apache.org/jira/browse/OAK-10547 > Project: Jackrabbit Oak > Issue Type: Bug > Components: indexing >Reporter: Nuno Santos >Priority: Major > > {noformat} > 10:21:04.989 [main] ERROR com.adobe.granite.indexing.tool.Main - Can't > perform operation > java.time.DateTimeException: Invalid value for SecondOfDay (valid values 0 - > 86399): 98646 > at > java.base/java.time.temporal.ValueRange.checkValidValue(ValueRange.java:311) > at > java.base/java.time.temporal.ChronoField.checkValidValue(ChronoField.java:717) > at java.base/java.time.LocalTime.ofSecondOfDay(LocalTime.java:380) > at > org.apache.jackrabbit.oak.plugins.index.FormattingUtils.formatToSeconds(FormattingUtils.java:27) > at > org.apache.jackrabbit.oak.index.indexer.document.DocumentStoreIndexerBase.reindex(DocumentStoreIndexerBase.java:303) > at com.adobe.granite.indexing.tool.ReindexCmd.index(ReindexCmd.java:195) > at com.adobe.granite.indexing.tool.ReindexCmd.run(ReindexCmd.java:134) > at com.adobe.granite.indexing.tool.Main.main(Main.java:112) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10547) Indexing job fails at the end of reindexing if it took more than 24h to run
Nuno Santos created OAK-10547: - Summary: Indexing job fails at the end of reindexing if it took more than 24h to run Key: OAK-10547 URL: https://issues.apache.org/jira/browse/OAK-10547 Project: Jackrabbit Oak Issue Type: Bug Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10541) Pipelined strategy: improve memory management of transform stage
Nuno Santos created OAK-10541: - Summary: Pipelined strategy: improve memory management of transform stage Key: OAK-10541 URL: https://issues.apache.org/jira/browse/OAK-10541 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10538) Pipeline strategy: eliminate unnecessary intermediate copy of entries in transform stage
Nuno Santos created OAK-10538: - Summary: Pipeline strategy: eliminate unnecessary intermediate copy of entries in transform stage Key: OAK-10538 URL: https://issues.apache.org/jira/browse/OAK-10538 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10460) PIPELINED strategy fails with OOME during final merge phase for very large repositories
[ https://issues.apache.org/jira/browse/OAK-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10460. --- Fix Version/s: 1.60.0 Resolution: Done > PIPELINED strategy fails with OOME during final merge phase for very large > repositories > --- > > Key: OAK-10460 > URL: https://issues.apache.org/jira/browse/OAK-10460 > Project: Jackrabbit Oak > Issue Type: Bug > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10519) Export metrics from indexing job
Nuno Santos created OAK-10519: - Summary: Export metrics from indexing job Key: OAK-10519 URL: https://issues.apache.org/jira/browse/OAK-10519 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10504) Add indexing job total duration log message
[ https://issues.apache.org/jira/browse/OAK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10504. --- Fix Version/s: 1.60.0 Resolution: Done > Add indexing job total duration log message > --- > > Key: OAK-10504 > URL: https://issues.apache.org/jira/browse/OAK-10504 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10437) Deprecate all download strategies except PIPELINED
[ https://issues.apache.org/jira/browse/OAK-10437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10437. --- Fix Version/s: 1.60.0 Resolution: Done > Deprecate all download strategies except PIPELINED > -- > > Key: OAK-10437 > URL: https://issues.apache.org/jira/browse/OAK-10437 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: Indexing > Fix For: 1.60.0 > > > Deprecate these strategies: STORE_AND_SORT, TRAVERSE_WITH_SORT, > MULTITHREADED_TRAVERSE_WITH_SORT. > > When they are used, print a log message saying clearly that they are > deprecated for removal and suggest using PIPELINED. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10505) Make PIPELINED the default download strategy in the indexing job
[ https://issues.apache.org/jira/browse/OAK-10505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10505. --- Fix Version/s: 1.60.0 Resolution: Done > Make PIPELINED the default download strategy in the indexing job > > > Key: OAK-10505 > URL: https://issues.apache.org/jira/browse/OAK-10505 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.60.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10505) Make PIPELINED the default download strategy in the indexing job
Nuno Santos created OAK-10505: - Summary: Make PIPELINED the default download strategy in the indexing job Key: OAK-10505 URL: https://issues.apache.org/jira/browse/OAK-10505 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10504) Add indexing job total duration log message
[ https://issues.apache.org/jira/browse/OAK-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10504: -- Priority: Minor (was: Major) > Add indexing job total duration log message > --- > > Key: OAK-10504 > URL: https://issues.apache.org/jira/browse/OAK-10504 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10504) Add indexing job total duration log message
Nuno Santos created OAK-10504: - Summary: Add indexing job total duration log message Key: OAK-10504 URL: https://issues.apache.org/jira/browse/OAK-10504 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10491) Indexing: pass a MongoDatabase instance instead of MongoConnection to indexing logic
[ https://issues.apache.org/jira/browse/OAK-10491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10491. --- Fix Version/s: 1.60.0 Resolution: Done > Indexing: pass a MongoDatabase instance instead of MongoConnection to > indexing logic > > > Key: OAK-10491 > URL: https://issues.apache.org/jira/browse/OAK-10491 > Project: Jackrabbit Oak > Issue Type: Improvement >Reporter: Nuno Santos >Priority: Major > Fix For: 1.60.0 > > > The pipeline indexing strategy needs to have access to a MongoDatabase object > to register a custom codec to deserialize the responses from Mongo. > Previously we were passing a MongoConnection object which contained a > reference to MongoDatabase. But the indexing job does not need any other > fields from MongoConnection other than MongoDatabase. But requiring > MongoConnection makes it harder for users of Oak to call this API. > We can simplify the logic by requiring only a MongoConnection. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10491) Indexing: pass a MongoDatabase instance instead of MongoConnection to indexing logic
Nuno Santos created OAK-10491: - Summary: Indexing: pass a MongoDatabase instance instead of MongoConnection to indexing logic Key: OAK-10491 URL: https://issues.apache.org/jira/browse/OAK-10491 Project: Jackrabbit Oak Issue Type: Improvement Reporter: Nuno Santos The pipeline indexing strategy needs to have access to a MongoDatabase object to register a custom codec to deserialize the responses from Mongo. Previously we were passing a MongoConnection object which contained a reference to MongoDatabase. But the indexing job does not need any other fields from MongoConnection other than MongoDatabase. But requiring MongoConnection makes it harder for users of Oak to call this API. We can simplify the logic by requiring only a MongoConnection. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10475) Expose the mongo connection in MongoDocumentNodeStoreBuilderBase
[ https://issues.apache.org/jira/browse/OAK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10475. --- Fix Version/s: 1.58.0 Resolution: Done > Expose the mongo connection in MongoDocumentNodeStoreBuilderBase > > > Key: OAK-10475 > URL: https://issues.apache.org/jira/browse/OAK-10475 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.58.0 > > > This is a follow up of OAK-10453, which changed the indexing job to use > custom codecs when downloading from Mongo. For this, the indexing logic needs > access to the Mongo Connection to register the codec. If a library client > uses MongoDocumentNodeStoreBuilderBase to build a MongoDocumentStore > instance, then it may also need access to the Mongo connection to pass it to > the indexing logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10475) Expose the mongo connection in MongoDocumentNodeStoreBuilderBase
[ https://issues.apache.org/jira/browse/OAK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10475: -- Description: This is a follow up of OAK-10453, which changed the indexing job to use custom codecs when downloading from Mongo. For this, the indexing logic needs access to the Mongo Connection to register the codec. If a client uses the MongoDocumentStore (was: This is a follow up of ) > Expose the mongo connection in MongoDocumentNodeStoreBuilderBase > > > Key: OAK-10475 > URL: https://issues.apache.org/jira/browse/OAK-10475 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > This is a follow up of OAK-10453, which changed the indexing job to use > custom codecs when downloading from Mongo. For this, the indexing logic needs > access to the Mongo Connection to register the codec. If a client uses the > MongoDocumentStore -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10475) Expose the mongo connection in MongoDocumentNodeStoreBuilderBase
[ https://issues.apache.org/jira/browse/OAK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10475: -- Description: This is a follow up of OAK-10453, which changed the indexing job to use custom codecs when downloading from Mongo. For this, the indexing logic needs access to the Mongo Connection to register the codec. If a library client uses MongoDocumentNodeStoreBuilderBase to build a MongoDocumentStore instance, then it may also need access to the Mongo connection to pass it to the indexing logic. (was: This is a follow up of OAK-10453, which changed the indexing job to use custom codecs when downloading from Mongo. For this, the indexing logic needs access to the Mongo Connection to register the codec. If a client uses the MongoDocumentStore) > Expose the mongo connection in MongoDocumentNodeStoreBuilderBase > > > Key: OAK-10475 > URL: https://issues.apache.org/jira/browse/OAK-10475 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > This is a follow up of OAK-10453, which changed the indexing job to use > custom codecs when downloading from Mongo. For this, the indexing logic needs > access to the Mongo Connection to register the codec. If a library client > uses MongoDocumentNodeStoreBuilderBase to build a MongoDocumentStore > instance, then it may also need access to the Mongo connection to pass it to > the indexing logic. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10475) Expose the mongo connection in MongoDocumentNodeStoreBuilderBase
[ https://issues.apache.org/jira/browse/OAK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10475: -- Description: This is a follow up of > Expose the mongo connection in MongoDocumentNodeStoreBuilderBase > > > Key: OAK-10475 > URL: https://issues.apache.org/jira/browse/OAK-10475 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > This is a follow up of -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10475) Expose the mongo connection in MongoDocumentNodeStoreBuilderBase
Nuno Santos created OAK-10475: - Summary: Expose the mongo connection in MongoDocumentNodeStoreBuilderBase Key: OAK-10475 URL: https://issues.apache.org/jira/browse/OAK-10475 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10458) Indexing job: Make LZ4 the default compression algorithm in OAK
[ https://issues.apache.org/jira/browse/OAK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10458. --- Fix Version/s: 1.58.0 Resolution: Done > Indexing job: Make LZ4 the default compression algorithm in OAK > --- > > Key: OAK-10458 > URL: https://issues.apache.org/jira/browse/OAK-10458 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.58.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10453) Pipelined strategy: enforce size limit on memory taken by objects in the queue between download and transform thread
[ https://issues.apache.org/jira/browse/OAK-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10453. --- Fix Version/s: 1.58.0 Resolution: Done > Pipelined strategy: enforce size limit on memory taken by objects in the > queue between download and transform thread > > > Key: OAK-10453 > URL: https://issues.apache.org/jira/browse/OAK-10453 > Project: Jackrabbit Oak > Issue Type: Bug > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.58.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10458) Indexing job: Make LZ4 the default compression algorithm in OAK
[ https://issues.apache.org/jira/browse/OAK-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10458. --- Fix Version/s: 1.58.0 Resolution: Fixed > Indexing job: Make LZ4 the default compression algorithm in OAK > --- > > Key: OAK-10458 > URL: https://issues.apache.org/jira/browse/OAK-10458 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: Indexing > Fix For: 1.58.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10452) Indexing job/regex filtering: getting ancestors nodes of filtered path incorrectly does a full col scan on Mongo
[ https://issues.apache.org/jira/browse/OAK-10452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10452. --- Fix Version/s: 1.58.0 Resolution: Fixed > Indexing job/regex filtering: getting ancestors nodes of filtered path > incorrectly does a full col scan on Mongo > > > Key: OAK-10452 > URL: https://issues.apache.org/jira/browse/OAK-10452 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.58.0 > > > In the PIPELINED strategy of the indexing job, when regex path filtering is > enabled, the job does two queries to Mongo: > * Download the ancestors of the base path (eg., {{0:/}}, {{1:/p1}}, > {{2:/p1/p2}}). > * Download all the children of the base path (eg., {{???:/p1/p2/*}}) > The first query returns only a few results so it should use the index on > {{_id}}. However, to deal with the rare case where the path is a long path > and the {{_id}} field is actually a hash instead of the path, the query for > the ancestors is also searching for matches on the {{_path}} field, which > will be set if {{_id}} is an hash. The issue here is that {{_path}} is not > indexed, so the first query reverts to a full col scan, which is much slower > than an index scan for the handful of ancestors. This negates most or even > all of the gains of using regex filtering. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10460) PIPELINED strategy fails with OOME during final merge phase for very large repositories
Nuno Santos created OAK-10460: - Summary: PIPELINED strategy fails with OOME during final merge phase for very large repositories Key: OAK-10460 URL: https://issues.apache.org/jira/browse/OAK-10460 Project: Jackrabbit Oak Issue Type: Bug Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10458) Indexing job: Make LZ4 the default compression algorithm in OAK
Nuno Santos created OAK-10458: - Summary: Indexing job: Make LZ4 the default compression algorithm in OAK Key: OAK-10458 URL: https://issues.apache.org/jira/browse/OAK-10458 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10453) Pipelined strategy: enforce size limit on memory taken by objects in the queue between download and transform thread
Nuno Santos created OAK-10453: - Summary: Pipelined strategy: enforce size limit on memory taken by objects in the queue between download and transform thread Key: OAK-10453 URL: https://issues.apache.org/jira/browse/OAK-10453 Project: Jackrabbit Oak Issue Type: Bug Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10452) Indexing job/regex filtering: getting ancestors nodes of filtered path incorrectly does a full col scan on Mongo
Nuno Santos created OAK-10452: - Summary: Indexing job/regex filtering: getting ancestors nodes of filtered path incorrectly does a full col scan on Mongo Key: OAK-10452 URL: https://issues.apache.org/jira/browse/OAK-10452 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos In the PIPELINED strategy of the indexing job, when regex path filtering is enabled, the job does two queries to Mongo: * Download the ancestors of the base path (eg., {{0:/}}, {{1:/p1}}, {{2:/p1/p2}}). * Download all the children of the base path (eg., {{???:/p1/p2/*}}) The first query returns only a few results so it should use the index on {{_id}}. However, to deal with the rare case where the path is a long path and the {{_id}} field is actually a hash instead of the path, the query for the ancestors is also searching for matches on the {{_path}} field, which will be set if {{_id}} is an hash. The issue here is that {{_path}} is not indexed, so the first query reverts to a full col scan, which is much slower than an index scan for the handful of ancestors. This negates most or even all of the gains of using regex filtering. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10423) Improve logging of metrics in indexing job
[ https://issues.apache.org/jira/browse/OAK-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10423. --- Fix Version/s: 1.58.0 Resolution: Done > Improve logging of metrics in indexing job > -- > > Key: OAK-10423 > URL: https://issues.apache.org/jira/browse/OAK-10423 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: indexing > Fix For: 1.58.0 > > > Improvements: > - Report all times in {{{}hours:minutes:seconds{}}}. Currently we are > relying on the Guava's {{{}Stopwatch.toString(){}}}, which reports times as > {{{}hh.[00-99]{}}}, instead of the standard 0 to 60 minutes. But if a > duration is less than 1h, then it reports number of minutes. This is > inconsistent and confusing, so we should always use the standard time format. > - Report time for index upload to Lucene > The MT traverse strategy is not reporting times for some phases, but as we > are phasing out this strategy, it is not important to improve metrics > reporting for it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10438) Remove deprecated download strategies
Nuno Santos created OAK-10438: - Summary: Remove deprecated download strategies Key: OAK-10438 URL: https://issues.apache.org/jira/browse/OAK-10438 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10437) Deprecate all download strategies except PIPELINED
Nuno Santos created OAK-10437: - Summary: Deprecate all download strategies except PIPELINED Key: OAK-10437 URL: https://issues.apache.org/jira/browse/OAK-10437 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos Deprecate these strategies: STORE_AND_SORT, TRAVERSE_WITH_SORT, MULTITHREADED_TRAVERSE_WITH_SORT. When they are used, print a log message saying clearly that they are deprecated for removal and suggest using PIPELINED. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10358) Indexing job: push filtering of paths to MongoDB
[ https://issues.apache.org/jira/browse/OAK-10358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10358. --- Fix Version/s: 1.58.0 Resolution: Done > Indexing job: push filtering of paths to MongoDB > > > Key: OAK-10358 > URL: https://issues.apache.org/jira/browse/OAK-10358 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > Fix For: 1.58.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10423) Improve logging of metrics in indexing job
[ https://issues.apache.org/jira/browse/OAK-10423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10423: -- Description: Improvements: - Report all times in {{{}hours:minutes:seconds{}}}. Currently we are relying on the Guava's {{{}Stopwatch.toString(){}}}, which reports times as {{{}hh.[00-99]{}}}, instead of the standard 0 to 60 minutes. But if a duration is less than 1h, then it reports number of minutes. This is inconsistent and confusing, so we should always use the standard time format. - Report time for index upload to Lucene The MT traverse strategy is not reporting times for some phases, but as we are phasing out this strategy, it is not important to improve metrics reporting for it. > Improve logging of metrics in indexing job > -- > > Key: OAK-10423 > URL: https://issues.apache.org/jira/browse/OAK-10423 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Labels: indexing > > Improvements: > - Report all times in {{{}hours:minutes:seconds{}}}. Currently we are > relying on the Guava's {{{}Stopwatch.toString(){}}}, which reports times as > {{{}hh.[00-99]{}}}, instead of the standard 0 to 60 minutes. But if a > duration is less than 1h, then it reports number of minutes. This is > inconsistent and confusing, so we should always use the standard time format. > - Report time for index upload to Lucene > The MT traverse strategy is not reporting times for some phases, but as we > are phasing out this strategy, it is not important to improve metrics > reporting for it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10423) Improve logging of metrics in indexing job
Nuno Santos created OAK-10423: - Summary: Improve logging of metrics in indexing job Key: OAK-10423 URL: https://issues.apache.org/jira/browse/OAK-10423 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10375) Binary data in logs related to the haystack property
Nuno Santos created OAK-10375: - Summary: Binary data in logs related to the haystack property Key: OAK-10375 URL: https://issues.apache.org/jira/browse/OAK-10375 Project: Jackrabbit Oak Issue Type: Bug Components: indexing Reporter: Nuno Santos When indexing documents with the {{haystack0}} property, some log messages contain the binary data of the property. In the log below, I replaced the binary data by {{{}{}}}, but it is usually very long. {noformat} 16:30:40.107 [main] ERROR o.a.j.o.p.i.l.LuceneDocumentMaker - could not index similarity field for property haystack0 = and definition PropertyDefinition\{name='jcr:content/metadata/imageFeatures/haystack0', propertyType=0, boost=1.0, isRegexp=false, index=true, stored=false, nodeScopeIndex=true, propertyIndex=true, analyzed=false, ordered=false, useInSuggest=false, useInSimilarity=true, nullCheckEnabled=false, notNullCheckEnabled=false, function=null} {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10356) Adjust lower and upper bounds of auto-detected memory limits in PipelinedStrategy
[ https://issues.apache.org/jira/browse/OAK-10356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10356. --- Resolution: Done > Adjust lower and upper bounds of auto-detected memory limits in > PipelinedStrategy > - > > Key: OAK-10356 > URL: https://issues.apache.org/jira/browse/OAK-10356 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10358) Indexing job: push filtering of paths to MongoDB
Nuno Santos created OAK-10358: - Summary: Indexing job: push filtering of paths to MongoDB Key: OAK-10358 URL: https://issues.apache.org/jira/browse/OAK-10358 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10356) Adjust lower and upper bounds of auto-detected memory limits in PipelinedStrategy
Nuno Santos created OAK-10356: - Summary: Adjust lower and upper bounds of auto-detected memory limits in PipelinedStrategy Key: OAK-10356 URL: https://issues.apache.org/jira/browse/OAK-10356 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (OAK-10350) Update spring-boot dependency to version 2.7.13
[ https://issues.apache.org/jira/browse/OAK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744153#comment-17744153 ] Nuno Santos commented on OAK-10350: --- Probably because I am not a committer. I answered in the PR. > Update spring-boot dependency to version 2.7.13 > --- > > Key: OAK-10350 > URL: https://issues.apache.org/jira/browse/OAK-10350 > Project: Jackrabbit Oak > Issue Type: Task > Components: standalone >Reporter: Manfred Baedke >Assignee: Manfred Baedke >Priority: Minor > Fix For: 1.56.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10294) Indexing job: add new Pipelined Strategy for dumping Mongo contents in preparation for reindexing
[ https://issues.apache.org/jira/browse/OAK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10294. --- Resolution: Done > Indexing job: add new Pipelined Strategy for dumping Mongo contents in > preparation for reindexing > - > > Key: OAK-10294 > URL: https://issues.apache.org/jira/browse/OAK-10294 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10294) Indexing job: add new Pipelined Strategy for dumping Mongo contents in preparation for reindexing
[ https://issues.apache.org/jira/browse/OAK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10294: -- Summary: Indexing job: add new Pipelined Strategy for dumping Mongo contents in preparation for reindexing (was: Pipelined Mongo dump for indexing job) > Indexing job: add new Pipelined Strategy for dumping Mongo contents in > preparation for reindexing > - > > Key: OAK-10294 > URL: https://issues.apache.org/jira/browse/OAK-10294 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10294) Pipelined Mongo dump for indexing job
[ https://issues.apache.org/jira/browse/OAK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10294: -- Summary: Pipelined Mongo dump for indexing job (was: Pipelined Mongo dump for indexing job (WIP - Do Not Merge)) > Pipelined Mongo dump for indexing job > - > > Key: OAK-10294 > URL: https://issues.apache.org/jira/browse/OAK-10294 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10294) Pipelined Mongo dump for indexing job (WIP - Do Not Merge)
Nuno Santos created OAK-10294: - Summary: Pipelined Mongo dump for indexing job (WIP - Do Not Merge) Key: OAK-10294 URL: https://issues.apache.org/jira/browse/OAK-10294 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10005) Bump Mockito from 3.12.4 to the latest 4.x release
[ https://issues.apache.org/jira/browse/OAK-10005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10005. --- Resolution: Duplicate > Bump Mockito from 3.12.4 to the latest 4.x release > -- > > Key: OAK-10005 > URL: https://issues.apache.org/jira/browse/OAK-10005 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > > Mockito 3.12.4 does not support running with Java 19. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (OAK-10004) Bump Elasticsearch Java client from 7.17.6 to 7.17.7
[ https://issues.apache.org/jira/browse/OAK-10004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos resolved OAK-10004. --- Fix Version/s: 1.46.0 Resolution: Done > Bump Elasticsearch Java client from 7.17.6 to 7.17.7 > > > Key: OAK-10004 > URL: https://issues.apache.org/jira/browse/OAK-10004 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > Fix For: 1.46.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10005) Bump Mockito from 3.12.4 to the latest 4.x release
Nuno Santos created OAK-10005: - Summary: Bump Mockito from 3.12.4 to the latest 4.x release Key: OAK-10005 URL: https://issues.apache.org/jira/browse/OAK-10005 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos Mockito 3.12.4 does not support running with Java 19. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10004) Bump Elasticsearch clients from 7.17.6 to 7.17.7
[ https://issues.apache.org/jira/browse/OAK-10004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10004: -- Summary: Bump Elasticsearch clients from 7.17.6 to 7.17.7 (was: Bump Elasticsearch clients from 7.17.3 to 7.17.6) > Bump Elasticsearch clients from 7.17.6 to 7.17.7 > > > Key: OAK-10004 > URL: https://issues.apache.org/jira/browse/OAK-10004 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-10004) Bump Elasticsearch Java client from 7.17.6 to 7.17.7
[ https://issues.apache.org/jira/browse/OAK-10004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-10004: -- Summary: Bump Elasticsearch Java client from 7.17.6 to 7.17.7 (was: Bump Elasticsearch clients from 7.17.6 to 7.17.7) > Bump Elasticsearch Java client from 7.17.6 to 7.17.7 > > > Key: OAK-10004 > URL: https://issues.apache.org/jira/browse/OAK-10004 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-10004) Bump Elasticsearch clients from 7.17.3 to 7.17.6
Nuno Santos created OAK-10004: - Summary: Bump Elasticsearch clients from 7.17.3 to 7.17.6 Key: OAK-10004 URL: https://issues.apache.org/jira/browse/OAK-10004 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (OAK-9965) Add support for running unit tests against Elasticsearch 8.4.3
[ https://issues.apache.org/jira/browse/OAK-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nuno Santos updated OAK-9965: - Summary: Add support for running unit tests against Elasticsearch 8.4.3 (was: Add support for testing with Elastiknn 8.4.3) > Add support for running unit tests against Elasticsearch 8.4.3 > -- > > Key: OAK-9965 > URL: https://issues.apache.org/jira/browse/OAK-9965 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: indexing >Reporter: Nuno Santos >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (OAK-9965) Add support for testing with Elastiknn 8.4.3
Nuno Santos created OAK-9965: Summary: Add support for testing with Elastiknn 8.4.3 Key: OAK-9965 URL: https://issues.apache.org/jira/browse/OAK-9965 Project: Jackrabbit Oak Issue Type: Improvement Components: indexing Reporter: Nuno Santos -- This message was sent by Atlassian Jira (v8.20.10#820010)