[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581731#comment-15581731 ] Ian Boston commented on OAK-3547: - Using Lucene directly to manage generations of the segments file is covered in OAK-4943 > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551486#comment-15551486 ] Ian Boston commented on OAK-3547: - The original patch was written such a long time ago, I don't think IndexCopier was present, or at least the deployment the patch was targeting did not have writeOnCopy etc enabled or possibly available. The impl of OakIndexFile is suboptimal for Lucene usage, as it loads chunks of the index into memory as byte[] to perform seek, whereas FSDirectory uses OS level native code to seek, hence it makes no sense to use OakDirectory any more. FSDirectory should be used by whatever means necessary. Might be an idea to delete or deprecate OakDirectory, so its not used for opening lucene indexes. The patch is in a state where it should not be applied or used. It can't efficiently determine corruption without direct access to the underlying file, which is abstracted by Oak. With the benefit of hindsight, the patch should be in IndexCopier to prevent a bad segments.gen file failing the index. We should close this issue as the patch isn't valid any more. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > Fix For: 1.6 > > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551315#comment-15551315 ] Chetan Mehrotra commented on OAK-3547: -- List of NodeState using generational approach [1] {noformat} Base state - No file added === /{saveDirectoryListing = true} :data :dir l_1475740652371{state = } l_1475740652360{state = } :data === 3 files added === /{saveDirectoryListing = true} :data :dir l_1475740652371{state = } l_1475740652394{state = foo2,foo2,612,5b41ac56a8454e48af2dd7aa740f71a96a740366;foo1,foo1,233,ff0e1a49171201d638848c3f3e9b003596a81d6b;foo0,foo0,73,7feabefe2276000fa7722d6536183a508816d2c7;} l_1475740652360{state = } :data foo2{blobSize = 1047552, jcr:lastModified = 1475740652394, jcr:data = [1 binaries], uniqueKey = 84dba956e1634dc14812db9a7ae6a81c} foo1{blobSize = 1047552, jcr:lastModified = 1475740652393, jcr:data = [1 binaries], uniqueKey = a9505647e35a7f0e0365adbc2531c91f} foo0{blobSize = 1047552, jcr:lastModified = 1475740652392, jcr:data = [1 binaries], uniqueKey = 937a0057a90df312f452aceb96f40c8c} === 3 'bar' files added === /{saveDirectoryListing = true} :data :dir l_1475740652371{state = } l_1475740652407{state = foo2,foo2,612,5b41ac56a8454e48af2dd7aa740f71a96a740366;bar2,bar2,656,2ea3ee49550e65f45d2e8d706e1a3bcef2d4a8b3;foo1,foo1,233,ff0e1a49171201d638848c3f3e9b003596a81d6b;bar1,bar1,953,a296f858ccb93414d6e99ca7a1997fb711faa65d;foo0,foo0,73,7feabefe2276000fa7722d6536183a508816d2c7;bar0,bar0,433,83ad0b2e68f114aa901c9eee0715c4507cd5dfe1;} l_1475740652394{state = foo2,foo2,612,5b41ac56a8454e48af2dd7aa740f71a96a740366;foo1,foo1,233,ff0e1a49171201d638848c3f3e9b003596a81d6b;foo0,foo0,73,7feabefe2276000fa7722d6536183a508816d2c7;} l_1475740652360{state = } :data foo2{blobSize = 1047552, jcr:lastModified = 1475740652394, jcr:data = [1 binaries], uniqueKey = 84dba956e1634dc14812db9a7ae6a81c} foo1{blobSize = 1047552, jcr:lastModified = 1475740652393, jcr:data = [1 binaries], uniqueKey = a9505647e35a7f0e0365adbc2531c91f} foo0{blobSize = 1047552, jcr:lastModified = 1475740652392, jcr:data = [1 binaries], uniqueKey = 937a0057a90df312f452aceb96f40c8c} bar1{blobSize = 1047552, jcr:lastModified = 1475740652406, jcr:data = [1 binaries], uniqueKey = 5ff21364ce67199cddd14833f0614d73} bar2{blobSize = 1047552, jcr:lastModified = 1475740652406, jcr:data = [1 binaries], uniqueKey = c10e5f46e90e01d927f0aee82980ac6d} bar0{blobSize = 1047552, jcr:lastModified = 1475740652405, jcr:data = [1 binaries], uniqueKey = 0f32d06997a4fec9a90060b863cda454} === {noformat} * Empty {{:data}} created under {{:data}}. Is this required or possibly due bug below. In line #2 same builder should be used {code} private SimpleDirectoryListing(@Nonnull IndexDefinition definition, @Nonnull NodeBuilder builder) { this.definition = definition; this.directoryBuilder = getOrCreateChild(builder, INDEX_DATA_CHILD_NAME); this.fileNames.addAll(getListing()); } {code} * load and save is called every time even if no change is done. This adds empty l_ddd nodes. This should be avoided * LISTING_STATE_PROPERTY - ** It holds an encoded listing info stored as single string property. Hopefully this does not grow very large if directory content is large. Not sure of typical sizes ** You can possibly use a MultivalueProperty here. * Not sure on below snippet in {{load}} method. {code} if (loaded >= 0 ) { return (loaded != childNodes.size() - 1); } {code} * {{doGC}} and {{sync}} methods do not have any test coverage *Feature Flag* It would be more comforting if this feature is driven by a feature flag. So generation logic is used if enabled otherwise it defaults to {{GenerationalDirectoryListing}}. We can expose a setting in {{LuceneIndexProviderService}} to lock new feature so as to enable controlled testing. I think having flag to enable when it is not being used would be easy. What would be tricky is to have it disabled once enabled ... that aspect can possibly be ignored h5. Effect of corruption on writes Currently if the corruption occurs system would automatically fallback to older version. This is fine for reads but for writes this would mean data loss unless indexed. As Async indexer would only index newer stuff. We have 2 options here # Let async indexer continue but provide some indication that index is corrupt and reindex is required in some time - This needs to be highlighted in prominenet way (periodic logs, JMX etc) # Let fallback used for reads (readOnly == true) but let it fail for writes Possibly this needs to be exposed as conf
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529963#comment-15529963 ] Ian Boston commented on OAK-3547: - Some inspection of all the metrics captured indicates that the method [1] is the main cause of differences in times taken to open, sync and close the directory index, as each operation must generate a fresh SHA1 from all the index files. If it was possible to rely on some other mechanism for checking the integrity of each file, this checksum could be replaced with something much simpler, like file length which would avoid generating a sha1 on each operation. This would then rely on Oak Lucene managing the recovery rather than the Oak Directory listing being self healing. To achieve this, a drop() method on the OakDirectory might be required to drop the current generation of the listing on demand. 1 https://github.com/apache/jackrabbit-oak/compare/trunk...ieb:OAK-3547#diff-28ec89220db72ab858b9eb25927c2a29R1026 > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > Fix For: 1.6 > > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529581#comment-15529581 ] Ian Boston commented on OAK-3547: - I have tested the patch in Sling and AEM 6.2+Oak 1.6, no functional regressions or errors were seen. There is an indication that opening the index is significantly slower with the patch as, in order to verify all the files in the index the files are read and a sha1 is generated from those files to ensure that the files are not damaged. Instrumenting the OakDirectory constructor call, which in the generational version validates the contents of the directory before opening, reveals. WIthout OAK-3547 patch {code} t count max meanmin stddev p50 p75 p95 p98 p99 p999mean_rate m1_rate m5_rate m15_raterate_unit duration_unit 1475066389 40 10.221184 0.1488390.022663 0.8821450.0722110.0722110.0784190.15831 0.15831 10.221184 0.07476 0.00111 0.2274860.687796calls/second milliseconds {code} With OAK-3547 patch {code} t count max meanmin stddev p50 p75 p95 p98 p99 p999mean_rate m1_rate m5_rate m15_raterate_unit duration_unit 1475063657 40 571.811075 378.76475 0.125733 209.479519 492.429012 492.429012 492.429012 571.811075 571.811075 571.811075 0.0823290.0026450.269424 0.727275calls/secondmilliseconds {code} How much the difference is will depend on the size of the index files. The patch may also transfer the IO read operation on the index from outside the OakDirectory constructor to inside the OakDirectory constructor, so these readings may or may not be significant. If they prove to be significant, then the SHA1 on files could be dropped on every open directory open and some other check be performed. Other checks won't be as robust as a full SHA1 check. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > Fix For: 1.6 > > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516665#comment-15516665 ] Ian Boston commented on OAK-3547: - There is now unit test coverage to validate that the correct previous generation of the directory will be opened in the event the underlying files are lost or damaged. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > Fix For: 1.6 > > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513803#comment-15513803 ] Ian Boston commented on OAK-3547: - Build passes upto LDAP {code} INFO] Oak HTTP Binding ... SUCCESS [ 3.334 s] [INFO] Oak Lucene . SUCCESS [04:26 min] [INFO] Oak Solr core .. SUCCESS [ 53.040 s] [INFO] Oak Solr OSGi .. SUCCESS [ 44.762 s] [INFO] Oak External Authentication Support SUCCESS [ 58.348 s] [INFO] Oak LDAP Authentication Support FAILURE [27:47 min] [INFO] Oak TarMK Standby .. SKIPPED [INFO] Oak Remote API . SKIPPED [INFO] Oak CUG Authorization .. SKIPPED {code} > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > Fix For: 1.6 > > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513619#comment-15513619 ] Ian Boston commented on OAK-3547: - The branch at [1] has been updated to work with Trunk at r1761930 and passes all unit tests in the bundle build. Doing full build now to verify no regressions. 1 https://github.com/apache/jackrabbit-oak/compare/trunk...ieb:OAK-3547 cc: [~chetanm] > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > Fix For: 1.6 > > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993557#comment-14993557 ] Ian Boston commented on OAK-3547: - With the latest commit to the branch, segments.gen is now stored in segments.gen_ and is immutable making it possible to correctly open previous generations. I think the patch is complete, unless a more sophisticated recovery mechanism is required. (ie flagging the index as requiring a rebuild, without actually doing it). Obviously needs extensive testing to see what happens when real repo corruption happens in a live cluster. At present only tested with a single instance on MongoMK. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991734#comment-14991734 ] Ian Boston commented on OAK-3547: - segment.gen is opened by Lucene as well as segment_xx which means that its mutable and used. To allow previous generations to be used, the name will need to be transformed inside the listing, if that is possible. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991349#comment-14991349 ] Ian Boston commented on OAK-3547: - [~mreutegg] If an earlier version of the index is used by the writer, there will he holes in the index and items will be missing. There are several options. a) flag the issue to alert admins the index is not healthy, but continue to index using an index that will open. b) Fail the index write and stop indexing completely. c) Fail the index write and start re-indexing automatically. Of those I think option a will deliver the best continuity. Option b risks wide scale application level issues, option c risks both application level issues and potential unavailability caused by the load or rebuilding an index from scratch. There is no easy answer. Now that there are checksums in place I have been seeing more frequent race conditions between the writer and the readers which occasionally open older versions. I think this is because the OakDirectory checks all the files when its opened by computing a checksum of everything referenced. I think that Lucene delays checking the file or checking the internals of a file until its needed, hence any errors are more visible than before. Lucene already has a concept of committing the index by syncing the segment_xx and segment.gen files. I am writing the listing node on sync of either of these or close of the index which has reduced the number of generations. The result appears to be very stable. I have also introduced the concept of mutability as some of the file types are mutable. .del is mutable, so the length and checksum are not checked. If a .del from a later generation is used, that will only delete the lucene docs that were deleted in that later generation. No damage. segments.gen is also mutable. This is more of a problem. It is supposed to be a fallback file with segment_xx used in preference, however if segment.gen is used it will be from the wrong generation and will define the wrong set of segment files for the index. I need to check if segment.gen is ever read. If it is, then I think the OakDirectory needs to map segment.gen to a generational version of the same (ie segment.gen_) so that only .del files are mutable. That should make the OakDirectory recoverable. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989655#comment-14989655 ] Chetan Mehrotra commented on OAK-3547: -- I think this approach should only affect the flow on Query side. If the index gets corrupted we should let AsyncIndexUpdate fail. The purpose of this feature is to avoid immediate downtime. Another option would be to mark that index as kind of disabled so that it does not block the indexing cycle and set its reindex flag set to true. Then either it gets automatically reindex in next cycle or we expose some JMX operation so that admin can determine when the reindexing is performed > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989642#comment-14989642 ] Marcel Reutegger commented on OAK-3547: --- I was referring to {{NodeStore.checkpoint()}} used by the {{AsyncIndexUpdate}}. The index update uses those checkpoints to determine changes that need to be indexed. Right now the checkpoint is released after the lucene index was updated. If we revert back to an earlier version of the lucene index don't we miss changes because the next index update will be based on the current checkpoint? > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989600#comment-14989600 ] Ian Boston commented on OAK-3547: - [~mreutegg] Currently every call to OakDirectory.sync(...) and OakDirectory.close(...) where the OakDirectory is not a read only oak directory, causes a list of files with size and sha1 hash to be written to a new node with a name of the form /oak:index//:state/l_. When the current Oak session commits, that is committed to the Oak repo. When the OakDirectory is loaded, it tries upto 100 l_ nodes in order, newest first, checking that the contents are present and have matching length+sha1. The first valid listing found is loaded. If no valid matches are found then the code reverts to earlier behaviour, using all the non deleted files in the /oak:index//:dir folder. If the bundle is deployed to an existing repository it will fall back to the old behaviour. I have assumed a call to either OakDirectory.sync(...) or OakDirectory.close(...) indicates a checkpoint of the Lucene indexing process. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989491#comment-14989491 ] Marcel Reutegger commented on OAK-3547: --- IIUC the lucene index will fall back to a previous version in case it faces an inconsistent state. Is there also some coordination with the checkpoint associated with the previous known good state of the lucene index? > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989329#comment-14989329 ] Ian Boston commented on OAK-3547: - Attempts to break the index operation have produced recovery behaviour that appears stable. {code} 04.11.2015 10:02:20.391 *INFO* [aysnc-index-update-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Saving Listing state to l_1446631340351 as _3.si,229,1a1c2c541a3cb5087d2a3c60ae77b6b05461410d;segments_3,117,a94bece6be54b2690a3b11925ed9bbc80e914d5d;segments_4,117,a74de5ae54af2abbb2423573cd0ad93b950ec18a;_2.cfe,224,dd2d758773e57e172933ff3d3fc3a4908af59dc4;_0_1.del,36,906f2506ff277e28716cc19eb8b55f289e34c53c;_0.si,229,57e0616d14993ad2a9680f55d4151d440cad8255;_2.si,229,ceddd18aa9343666e78f6330d0e261e96474717b;segments.gen,20,395c2b9ba7f05f4debb52b0a7cea8ac56ad671a2;_3.cfe,224,2596860e7bcdd550e221488708afda8729689107;_0.cfs,1431868,a7368bb6e2398a5952eddbc062a498f100a29865;_0.cfe,266,62127a60fe2224e32e3720eb15b2bd9f34d4670a;_3.cfs,1210,1c7764b0713716c53d1b0ed21c0063c5606aad49;_2.cfs,1188,1cd8b738619350e6998b77c2142d5a3748e861f1; 04.11.2015 10:02:20.391 *INFO* [aysnc-index-update-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Saving due to close. 04.11.2015 10:02:20.427 *INFO* [aysnc-index-update-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Saving Listing state to l_1446631340391 as _3.si,229,1a1c2c541a3cb5087d2a3c60ae77b6b05461410d;segments_4,117,a74de5ae54af2abbb2423573cd0ad93b950ec18a;_0_1.del,36,906f2506ff277e28716cc19eb8b55f289e34c53c;_0.si,229,57e0616d14993ad2a9680f55d4151d440cad8255;segments.gen,20,395c2b9ba7f05f4debb52b0a7cea8ac56ad671a2;_3.cfe,224,2596860e7bcdd550e221488708afda8729689107;_0.cfs,1431868,a7368bb6e2398a5952eddbc062a498f100a29865;_0.cfe,266,62127a60fe2224e32e3720eb15b2bd9f34d4670a;_3.cfs,1210,1c7764b0713716c53d1b0ed21c0063c5606aad49; 04.11.2015 10:02:23.375 *WARN* [aysnc-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing IO Exception reading index file at org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.getIndexFileMetaData(OakDirectory.java:965) at org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.validateListing(OakDirectory.java:844) at org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.load(OakDirectory.java:878) at org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.(OakDirectory.java:750) at org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.(OakDirectory.java:728) 04.11.2015 10:02:23.377 *WARN* [aysnc-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Index File and Oak Version and not the same Name: _xi.fdt, length:14816174, checksum:ff821a1bde2330f1389c782a47677206c685 CheckSum: ff821a1bde2330f1389c782a47677206c685 != Unable to generate checksum java.lang.RuntimeException: failed to read block from backend, id b535214bddc090c74a426acaeeb5654140c1be52d4af824f2b759113c8a7bdc6@0, 04.11.2015 10:02:23.377 *WARN* [aysnc-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Rejected directory listing l_1446631317105 04.11.2015 10:02:24.104 *INFO* [aysnc-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Accepted directory listing l_1446631315966, using 04.11.2015 10:02:24.129 *INFO* [oak-lucene-0] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Accepted directory listing l_1446631317105, using 04.11.2015 10:02:25.034 *INFO* [aysnc-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Saving Listing state to l_1446631344105 as _xi.tim,7775213,f8e401ff1a95bba1c387a7d00239a9b10c4323ba;_xi.si,326,fcb94171f92573a5fd365b178dda96728c16da50;_xi_3.del,38,703e0294067b659a0f15f8114659b20de3d51385;_1oa.si,235,ddb8cb303fdf4372cc333a980a1755d25d64d6cb;_xi.pos,2748906,154e1ba4078865bc81f35646e36dea906e54b539;_xi.nvd,159060,abacf7b5c17963a24c9f54715ac40c1e2dd85f0b;_xi.fdt,14816174,ff821a1bde2330f1389c782a47677206c685;_xj.cfe,224,1403e26ddc7e3d005758f3dae6c8bbf50e9a4313;_1o9.cfe,224,f7926636965fdd76c2622d57f3e24a217f230a44;_xi.nvm,46,d09f4ec10424aac4b5a2fe1da422f266aace8bca;_xj.si,232,f119a8cab06c10a0e34d87b92e16bf1b28a688f3;_xi.fdx,1272306,0c62a780b2f3c62af3014bc530d79d8144a8f014;segments.gen,20,7b46fe18999a02b4247cbcd8222034d4a2c9291c;segments_1o6,157,21acb7510ff37ef18452d622d8fcf2
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989147#comment-14989147 ] Ian Boston commented on OAK-3547: - No issues seen running for 12h. Need to look into improving recovery capabilities. Currently any change to any file referenced in a listing causes rejection of the listing. Some lucene index files are mutable (the delete file) and so checking must be relaxed for those files as the file will change. Need to analyse the changes in the log files to check what really is mutable and what is immutable in a listing. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987582#comment-14987582 ] Ian Boston commented on OAK-3547: - Added GC on index files and listing files. At present the code will keep a max of 100 directory listing files files and wont delete any index files referenced in those directory listing files. When there are 10 or more directory listing files to delete, they are deleted and the index files are GC'd. The number of directory listing files can be changed. The check happens every time a non read only OakDirectory is opened. its quick to perform the check. Code just pushed, testing using AEM6.1 on MongoMK overnight. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986618#comment-14986618 ] Chetan Mehrotra commented on OAK-3547: -- bq. still need to do something to prune the listings and to delete files no longer referenced That you can do in AsyncIndexUpdate cycle itself say after every 10 cycle/2 hrs > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985293#comment-14985293 ] Ian Boston commented on OAK-3547: - Version just pushed works in AEM6.2 MongoMK with each save of the directory list represented as list of files with file size and sha1 as below {code} 02.11.2015 14:29:00.344 *INFO* [aysnc-index-update-fulltext-async] org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing Saving Listing state to l_1446474540217 as _6.si,229,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.fdx,1143274,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.pos,2256835,da39a3ee5e6b4b0d3255bfef95601890afd80709;_23.cfs,1592,da39a3ee5e6b4b0d3255bfef95601890afd80709;_6.cfe,224,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.fdt,12261250,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.tip,156795,da39a3ee5e6b4b0d3255bfef95601890afd80709;segments_1x,154,da39a3ee5e6b4b0d3255bfef95601890afd80709;_23.si,232,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.tim,6207095,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.si,316,da39a3ee5e6b4b0d3255bfef95601890afd80709;_23.cfe,224,da39a3ee5e6b4b0d3255bfef95601890afd80709;segments.gen,20,da39a3ee5e6b4b0d3255bfef95601890afd80709;_6.cfs,5403594,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.fnm,61229,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.doc,2073360,da39a3ee5e6b4b0d3255bfef95601890afd80709;_6_1.del,51,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b_1.del,38,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.nvd,142931,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.nvm,46,da39a3ee5e6b4b0d3255bfef95601890afd80709; {code} Since the save happens every few seconds, still need to do something to prune the listings and to delete files no longer referenced. Probably best done with code that runs every few hours. Slightly concerned at the frequency of close or sync operations performed on the OakDirectory > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982734#comment-14982734 ] Ian Boston commented on OAK-3547: - Patch in branch now passes build unit tests with the GenerationalDirectoryListing enabled. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors
[ https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971191#comment-14971191 ] Ian Boston commented on OAK-3547: - https://github.com/apache/jackrabbit-oak/compare/trunk...ieb:OAK-3547?expand=1 Currently the patch makes no change, but puts the current behaviour behind an interface, and provides 2 implementations. A SimpleDirectoryListing that uses the current implementation and a GenerationalDirectoryListing that writes a new version of the node every time the listing is changed, as well as checking length and UUID of the file when the listing opens. A Checksum is not implemented, as that looked too expensive to achieve given the blob structure. > Improve ability of the OakDirectory to recover from unexpected file errors > -- > > Key: OAK-3547 > URL: https://issues.apache.org/jira/browse/OAK-3547 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Affects Versions: 1.4 >Reporter: Ian Boston > > Currently if the OakDirectory finds that a file is missing or in some way > damaged, and exception is thrown which impacts all queries using that index, > at times making the index unavailable. This improvement aims to make the > OakDirectory recover to a previously ok state by storing which files were > involved in previous states, and giving the code some way of checking if they > are valid. -- This message was sent by Atlassian JIRA (v6.3.4#6332)