[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-10-17 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581731#comment-15581731
 ] 

Ian Boston commented on OAK-3547:
-

Using Lucene directly to manage generations of the segments file is covered in 
OAK-4943

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-10-06 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551486#comment-15551486
 ] 

Ian Boston commented on OAK-3547:
-

The original patch was written such a long time ago, I don't think IndexCopier 
was present, or at least the deployment the patch was targeting did not have 
writeOnCopy etc enabled or possibly available. The impl of OakIndexFile is 
suboptimal for Lucene usage, as it loads chunks of the index into memory as 
byte[] to perform seek, whereas FSDirectory uses OS level native code to seek, 
hence it makes no sense to use OakDirectory any more. FSDirectory should be 
used by whatever means necessary. Might be an idea to delete or deprecate 
OakDirectory, so its not used for opening lucene indexes.

The patch is in a state where it should not be applied or used. It can't 
efficiently determine corruption without direct access to the underlying file, 
which is abstracted by Oak.

With the benefit of hindsight, the patch should be in IndexCopier to prevent a 
bad segments.gen file failing the index.

We should close this issue as the patch isn't valid any more.


> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
> Fix For: 1.6
>
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-10-06 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551315#comment-15551315
 ] 

Chetan Mehrotra commented on OAK-3547:
--

List of NodeState using generational approach [1]

{noformat}
Base state - No file added
===
  /{saveDirectoryListing = true}
:data
  :dir
l_1475740652371{state = }
l_1475740652360{state = }
  :data
===

3 files added
===
  /{saveDirectoryListing = true}
:data
  :dir
l_1475740652371{state = }
l_1475740652394{state = 
foo2,foo2,612,5b41ac56a8454e48af2dd7aa740f71a96a740366;foo1,foo1,233,ff0e1a49171201d638848c3f3e9b003596a81d6b;foo0,foo0,73,7feabefe2276000fa7722d6536183a508816d2c7;}
l_1475740652360{state = }
  :data
  foo2{blobSize = 1047552, jcr:lastModified = 1475740652394, jcr:data = [1 
binaries], uniqueKey = 84dba956e1634dc14812db9a7ae6a81c}
  foo1{blobSize = 1047552, jcr:lastModified = 1475740652393, jcr:data = [1 
binaries], uniqueKey = a9505647e35a7f0e0365adbc2531c91f}
  foo0{blobSize = 1047552, jcr:lastModified = 1475740652392, jcr:data = [1 
binaries], uniqueKey = 937a0057a90df312f452aceb96f40c8c}
===


3 'bar' files added
===
  /{saveDirectoryListing = true}
:data
  :dir
l_1475740652371{state = }
l_1475740652407{state = 
foo2,foo2,612,5b41ac56a8454e48af2dd7aa740f71a96a740366;bar2,bar2,656,2ea3ee49550e65f45d2e8d706e1a3bcef2d4a8b3;foo1,foo1,233,ff0e1a49171201d638848c3f3e9b003596a81d6b;bar1,bar1,953,a296f858ccb93414d6e99ca7a1997fb711faa65d;foo0,foo0,73,7feabefe2276000fa7722d6536183a508816d2c7;bar0,bar0,433,83ad0b2e68f114aa901c9eee0715c4507cd5dfe1;}
l_1475740652394{state = 
foo2,foo2,612,5b41ac56a8454e48af2dd7aa740f71a96a740366;foo1,foo1,233,ff0e1a49171201d638848c3f3e9b003596a81d6b;foo0,foo0,73,7feabefe2276000fa7722d6536183a508816d2c7;}
l_1475740652360{state = }
  :data
  foo2{blobSize = 1047552, jcr:lastModified = 1475740652394, jcr:data = [1 
binaries], uniqueKey = 84dba956e1634dc14812db9a7ae6a81c}
  foo1{blobSize = 1047552, jcr:lastModified = 1475740652393, jcr:data = [1 
binaries], uniqueKey = a9505647e35a7f0e0365adbc2531c91f}
  foo0{blobSize = 1047552, jcr:lastModified = 1475740652392, jcr:data = [1 
binaries], uniqueKey = 937a0057a90df312f452aceb96f40c8c}
  bar1{blobSize = 1047552, jcr:lastModified = 1475740652406, jcr:data = [1 
binaries], uniqueKey = 5ff21364ce67199cddd14833f0614d73}
  bar2{blobSize = 1047552, jcr:lastModified = 1475740652406, jcr:data = [1 
binaries], uniqueKey = c10e5f46e90e01d927f0aee82980ac6d}
  bar0{blobSize = 1047552, jcr:lastModified = 1475740652405, jcr:data = [1 
binaries], uniqueKey = 0f32d06997a4fec9a90060b863cda454}
===

{noformat}

* Empty {{:data}} created under {{:data}}. Is this required or possibly due bug 
below. In line #2 same builder should be used
{code}
private SimpleDirectoryListing(@Nonnull IndexDefinition definition, 
@Nonnull NodeBuilder builder) {
this.definition = definition;
this.directoryBuilder = getOrCreateChild(builder, 
INDEX_DATA_CHILD_NAME);
this.fileNames.addAll(getListing());
}
{code}
* load and save is called every time even if no change is done. This adds empty 
l_ddd nodes. This should be avoided
* LISTING_STATE_PROPERTY - 
** It holds an encoded listing info stored as single string property. Hopefully 
this does not grow very large if directory content is large. Not sure of 
typical sizes
** You can possibly use a MultivalueProperty here. 
* Not sure on below snippet in {{load}} method.
{code}
if (loaded >= 0 ) {
return (loaded != childNodes.size() - 1);
}
{code}
* {{doGC}} and {{sync}} methods do not have any test coverage


*Feature Flag*
It would be more comforting if this feature is driven by a feature flag. So 
generation logic is used if enabled otherwise it defaults to 
{{GenerationalDirectoryListing}}. We can expose a setting in 
{{LuceneIndexProviderService}} to lock new feature so as to enable controlled 
testing. 

I think having flag to enable when it is not being used would be easy. What 
would be tricky is to have it disabled once enabled ... that aspect can 
possibly be ignored

h5. Effect of corruption on writes

Currently if the corruption occurs system would automatically fallback to older 
version. This is fine for reads but for writes this would mean data loss unless 
indexed. As Async indexer would only index newer stuff. We have 2 options here

# Let async indexer continue but provide some indication that index is corrupt 
and reindex is required in some time - This needs to be highlighted in 
prominenet way (periodic logs, JMX etc)
# Let fallback used for reads (readOnly == true) but let it fail for writes

Possibly this needs to be exposed as conf

[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-09-28 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529963#comment-15529963
 ] 

Ian Boston commented on OAK-3547:
-

Some inspection of all the metrics captured indicates that the method [1] is 
the main cause of differences in times taken to open, sync and close the 
directory index, as each operation must generate a fresh SHA1 from all the 
index files. 

If it was possible to rely on some other mechanism for checking the integrity 
of each file, this checksum could be replaced with something much simpler, like 
file length which would avoid generating a sha1 on each operation. This would 
then rely on Oak Lucene managing the recovery rather than the Oak Directory 
listing being self healing. To achieve this, a drop() method on the 
OakDirectory might be required to drop the current generation of the listing on 
demand.


1 
https://github.com/apache/jackrabbit-oak/compare/trunk...ieb:OAK-3547#diff-28ec89220db72ab858b9eb25927c2a29R1026

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
> Fix For: 1.6
>
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-09-28 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15529581#comment-15529581
 ] 

Ian Boston commented on OAK-3547:
-

I have tested the patch in Sling and AEM 6.2+Oak 1.6, no functional regressions 
or errors were seen.
There is an indication that opening the index is significantly slower with the 
patch as, in order to verify all the files in the index the files are read and 
a sha1 is generated from those files to ensure that the files are not damaged. 
Instrumenting the OakDirectory constructor call, which in the generational 
version validates the contents of the directory before opening, reveals.

WIthout OAK-3547 patch
{code}
t   count   max meanmin stddev  p50 p75 p95 p98 
p99 p999mean_rate   m1_rate m5_rate m15_raterate_unit   
duration_unit
1475066389  40  10.221184   0.1488390.022663
0.8821450.0722110.0722110.0784190.15831 0.15831 
10.221184   0.07476 0.00111 0.2274860.687796calls/second
milliseconds
{code}

With OAK-3547 patch
{code}
t   count   max meanmin stddev  p50 p75 p95 p98 
p99 p999mean_rate   m1_rate m5_rate m15_raterate_unit   
duration_unit
1475063657  40  571.811075  378.76475   0.125733
209.479519  492.429012  492.429012  492.429012  571.811075  
571.811075  571.811075  0.0823290.0026450.269424
0.727275calls/secondmilliseconds
{code}

How much the difference is will depend on the size of the index files. The 
patch may also transfer the IO read operation on the index from outside the 
OakDirectory constructor to inside the OakDirectory constructor, so these 
readings may or may not be significant.

If they prove to be significant, then the SHA1 on files could be dropped on 
every open directory open and some other check be performed. Other checks won't 
be as robust as a full SHA1 check.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
> Fix For: 1.6
>
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-09-23 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516665#comment-15516665
 ] 

Ian Boston commented on OAK-3547:
-

There is now unit test coverage to validate that the correct previous 
generation of the directory will be opened in the event the underlying files 
are lost or damaged.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
> Fix For: 1.6
>
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-09-22 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513803#comment-15513803
 ] 

Ian Boston commented on OAK-3547:
-

Build passes upto LDAP

{code}
INFO] Oak HTTP Binding ... SUCCESS [  3.334 s]
[INFO] Oak Lucene . SUCCESS [04:26 min]
[INFO] Oak Solr core .. SUCCESS [ 53.040 s]
[INFO] Oak Solr OSGi .. SUCCESS [ 44.762 s]
[INFO] Oak External Authentication Support  SUCCESS [ 58.348 s]
[INFO] Oak LDAP Authentication Support  FAILURE [27:47 min]
[INFO] Oak TarMK Standby .. SKIPPED
[INFO] Oak Remote API . SKIPPED
[INFO] Oak CUG Authorization .. SKIPPED
{code}

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
> Fix For: 1.6
>
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2016-09-22 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513619#comment-15513619
 ] 

Ian Boston commented on OAK-3547:
-

The branch at [1] has been updated to work with Trunk at r1761930 and passes 
all unit tests in the bundle build. Doing full build now to verify no 
regressions.

1 https://github.com/apache/jackrabbit-oak/compare/trunk...ieb:OAK-3547

cc: [~chetanm]


> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
> Fix For: 1.6
>
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-06 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993557#comment-14993557
 ] 

Ian Boston commented on OAK-3547:
-

With the latest commit to the branch, segments.gen is now stored in 
segments.gen_ and is immutable making it possible to correctly open 
previous generations.

I think the patch is complete, unless a more sophisticated recovery mechanism 
is required. (ie flagging the index as requiring a rebuild, without actually 
doing it).

Obviously needs extensive testing to see what happens when real repo corruption 
happens in a live cluster. At present only tested with a single instance on 
MongoMK.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-05 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991734#comment-14991734
 ] 

Ian Boston commented on OAK-3547:
-

segment.gen is opened by Lucene as well as segment_xx which means that its 
mutable and used. To allow previous generations to be used, the name will need 
to be transformed inside the listing, if that is possible.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-05 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991349#comment-14991349
 ] 

Ian Boston commented on OAK-3547:
-

[~mreutegg] If an earlier version of the index is used by the writer, there 
will he holes in the index and items will be missing. There are several 
options. a) flag the issue to alert admins the index is not healthy, but 
continue to index using an index that will open. b) Fail the index write and 
stop indexing completely. c) Fail the index write and start re-indexing 
automatically.  Of those I think option a will deliver the best continuity. 
Option b risks wide scale application level issues, option c risks both 
application level issues and potential unavailability caused by the load or 
rebuilding an index from scratch. There is no easy answer. 

Now that there are checksums in place I have been seeing more frequent race 
conditions between the writer and the readers which occasionally open older 
versions. I think this is because the OakDirectory checks all the files when 
its opened by computing a checksum of everything referenced. I think that 
Lucene delays checking the file or checking the internals of a file until its 
needed, hence any errors are more visible than before.



Lucene already has a concept of committing the index by syncing the segment_xx 
and segment.gen files. I am writing the listing node on sync of either of these 
or close of the index which has reduced the number of generations. The result 
appears to be very stable. I have also introduced the concept of mutability as 
some of the file types are mutable. .del is mutable, so the length and checksum 
are not checked. If a .del from a later generation is used, that will only 
delete the lucene docs that were deleted in that later generation. No damage. 
segments.gen is also mutable. This is more of a problem. It is supposed to be a 
fallback file with segment_xx used in preference, however if segment.gen is 
used it will be from the wrong generation and will define the wrong set of 
segment files for the index. I need to check if segment.gen is ever read. If it 
is, then I think the OakDirectory needs to map segment.gen to a generational 
version of the same (ie segment.gen_) so that only .del files are 
mutable. That should make the OakDirectory recoverable.






> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-04 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989655#comment-14989655
 ] 

Chetan Mehrotra commented on OAK-3547:
--

I think this approach should only affect the flow on Query side. If the index 
gets corrupted we should let AsyncIndexUpdate fail. The purpose of this feature 
is to avoid immediate downtime.

Another option would be to mark that index as kind of disabled so that it does 
not block the indexing cycle and set its reindex flag set to true. Then either 
it gets automatically reindex in next cycle or we expose some JMX operation so 
that admin can determine when the reindexing is performed

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-04 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989642#comment-14989642
 ] 

Marcel Reutegger commented on OAK-3547:
---

I was referring to {{NodeStore.checkpoint()}} used by the {{AsyncIndexUpdate}}. 
The index update uses those checkpoints to determine changes that need to be 
indexed. Right now the checkpoint is released after the lucene index was 
updated. If we revert back to an earlier version of the lucene index don't we 
miss changes because the next index update will be based on the current 
checkpoint?

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-04 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989600#comment-14989600
 ] 

Ian Boston commented on OAK-3547:
-

[~mreutegg] Currently every call to OakDirectory.sync(...) and 
OakDirectory.close(...) where the OakDirectory is not a read only oak 
directory, causes a list of files with size and sha1 hash to be written to a 
new node with a name of the form /oak:index//:state/l_. 
When the current Oak session commits, that is committed to the Oak repo.  When 
the OakDirectory is loaded, it tries upto 100 l_ nodes in order, newest 
first, checking that the contents are present and have matching length+sha1. 
The first valid listing found is loaded. If no valid matches are found then the 
code reverts to earlier behaviour, using all the non deleted files in the 
/oak:index//:dir folder. If the bundle is deployed to an existing 
repository it will fall back to the old behaviour.

I have assumed a call to either OakDirectory.sync(...) or 
OakDirectory.close(...)  indicates a checkpoint of the Lucene indexing process.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-04 Thread Marcel Reutegger (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989491#comment-14989491
 ] 

Marcel Reutegger commented on OAK-3547:
---

IIUC the lucene index will fall back to a previous version in case it faces an 
inconsistent state. Is there also some coordination with the checkpoint 
associated with the previous known good state of the lucene index?

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-04 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989329#comment-14989329
 ] 

Ian Boston commented on OAK-3547:
-

Attempts to break the index operation have produced recovery behaviour that 
appears stable.

{code}
04.11.2015 10:02:20.391 *INFO* [aysnc-index-update-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Saving Listing state to l_1446631340351 as 
_3.si,229,1a1c2c541a3cb5087d2a3c60ae77b6b05461410d;segments_3,117,a94bece6be54b2690a3b11925ed9bbc80e914d5d;segments_4,117,a74de5ae54af2abbb2423573cd0ad93b950ec18a;_2.cfe,224,dd2d758773e57e172933ff3d3fc3a4908af59dc4;_0_1.del,36,906f2506ff277e28716cc19eb8b55f289e34c53c;_0.si,229,57e0616d14993ad2a9680f55d4151d440cad8255;_2.si,229,ceddd18aa9343666e78f6330d0e261e96474717b;segments.gen,20,395c2b9ba7f05f4debb52b0a7cea8ac56ad671a2;_3.cfe,224,2596860e7bcdd550e221488708afda8729689107;_0.cfs,1431868,a7368bb6e2398a5952eddbc062a498f100a29865;_0.cfe,266,62127a60fe2224e32e3720eb15b2bd9f34d4670a;_3.cfs,1210,1c7764b0713716c53d1b0ed21c0063c5606aad49;_2.cfs,1188,1cd8b738619350e6998b77c2142d5a3748e861f1;
 
04.11.2015 10:02:20.391 *INFO* [aysnc-index-update-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Saving due to close.
04.11.2015 10:02:20.427 *INFO* [aysnc-index-update-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Saving Listing state to l_1446631340391 as 
_3.si,229,1a1c2c541a3cb5087d2a3c60ae77b6b05461410d;segments_4,117,a74de5ae54af2abbb2423573cd0ad93b950ec18a;_0_1.del,36,906f2506ff277e28716cc19eb8b55f289e34c53c;_0.si,229,57e0616d14993ad2a9680f55d4151d440cad8255;segments.gen,20,395c2b9ba7f05f4debb52b0a7cea8ac56ad671a2;_3.cfe,224,2596860e7bcdd550e221488708afda8729689107;_0.cfs,1431868,a7368bb6e2398a5952eddbc062a498f100a29865;_0.cfe,266,62127a60fe2224e32e3720eb15b2bd9f34d4670a;_3.cfs,1210,1c7764b0713716c53d1b0ed21c0063c5606aad49;
 
04.11.2015 10:02:23.375 *WARN* [aysnc-index-update-fulltext-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 IO Exception reading index file 
at 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.getIndexFileMetaData(OakDirectory.java:965)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.validateListing(OakDirectory.java:844)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.load(OakDirectory.java:878)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.(OakDirectory.java:750)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing.(OakDirectory.java:728)
04.11.2015 10:02:23.377 *WARN* [aysnc-index-update-fulltext-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Index File and Oak Version and not the same Name: _xi.fdt, length:14816174, 
checksum:ff821a1bde2330f1389c782a47677206c685  CheckSum: 
ff821a1bde2330f1389c782a47677206c685 != Unable to generate checksum 
java.lang.RuntimeException: failed to read block from backend, id 
b535214bddc090c74a426acaeeb5654140c1be52d4af824f2b759113c8a7bdc6@0,
04.11.2015 10:02:23.377 *WARN* [aysnc-index-update-fulltext-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Rejected directory listing l_1446631317105 
04.11.2015 10:02:24.104 *INFO* [aysnc-index-update-fulltext-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Accepted directory listing l_1446631315966, using 
04.11.2015 10:02:24.129 *INFO* [oak-lucene-0] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Accepted directory listing l_1446631317105, using 
04.11.2015 10:02:25.034 *INFO* [aysnc-index-update-fulltext-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Saving Listing state to l_1446631344105 as 
_xi.tim,7775213,f8e401ff1a95bba1c387a7d00239a9b10c4323ba;_xi.si,326,fcb94171f92573a5fd365b178dda96728c16da50;_xi_3.del,38,703e0294067b659a0f15f8114659b20de3d51385;_1oa.si,235,ddb8cb303fdf4372cc333a980a1755d25d64d6cb;_xi.pos,2748906,154e1ba4078865bc81f35646e36dea906e54b539;_xi.nvd,159060,abacf7b5c17963a24c9f54715ac40c1e2dd85f0b;_xi.fdt,14816174,ff821a1bde2330f1389c782a47677206c685;_xj.cfe,224,1403e26ddc7e3d005758f3dae6c8bbf50e9a4313;_1o9.cfe,224,f7926636965fdd76c2622d57f3e24a217f230a44;_xi.nvm,46,d09f4ec10424aac4b5a2fe1da422f266aace8bca;_xj.si,232,f119a8cab06c10a0e34d87b92e16bf1b28a688f3;_xi.fdx,1272306,0c62a780b2f3c62af3014bc530d79d8144a8f014;segments.gen,20,7b46fe18999a02b4247cbcd8222034d4a2c9291c;segments_1o6,157,21acb7510ff37ef18452d622d8fcf2

[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-04 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989147#comment-14989147
 ] 

Ian Boston commented on OAK-3547:
-

No issues seen running for 12h.
Need to look into improving recovery capabilities. Currently any change to any 
file referenced in a listing causes rejection of the listing. Some lucene index 
files are mutable (the  delete file) and so checking must be relaxed for those 
files as the file will change. Need to analyse the changes in the log files to 
check what really is mutable and what is immutable in a listing.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-03 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987582#comment-14987582
 ] 

Ian Boston commented on OAK-3547:
-

Added GC on index files and listing files. At present the code will keep a max 
of 100 directory listing files files and wont delete any index files referenced 
in those directory listing files. When there are 10 or more directory listing 
files to delete, they are deleted and the index files are GC'd. The number of 
directory listing files can be changed.
The check happens every time a non read only OakDirectory is opened. its quick 
to perform the check.
Code just pushed, testing using AEM6.1 on MongoMK overnight.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-02 Thread Chetan Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986618#comment-14986618
 ] 

Chetan Mehrotra commented on OAK-3547:
--

bq. still need to do something to prune the listings and to delete files no 
longer referenced

That you can do in AsyncIndexUpdate cycle itself say after every 10 cycle/2 hrs

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-11-02 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985293#comment-14985293
 ] 

Ian Boston commented on OAK-3547:
-

Version just pushed works in AEM6.2 MongoMK with each save of the directory 
list represented as list of files with file size and sha1 as below

{code}
02.11.2015 14:29:00.344 *INFO* [aysnc-index-update-fulltext-async] 
org.apache.jackrabbit.oak.plugins.index.lucene.OakDirectory$GenerationalDirectoryListing
 Saving Listing state to l_1446474540217 as 
_6.si,229,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.fdx,1143274,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.pos,2256835,da39a3ee5e6b4b0d3255bfef95601890afd80709;_23.cfs,1592,da39a3ee5e6b4b0d3255bfef95601890afd80709;_6.cfe,224,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.fdt,12261250,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.tip,156795,da39a3ee5e6b4b0d3255bfef95601890afd80709;segments_1x,154,da39a3ee5e6b4b0d3255bfef95601890afd80709;_23.si,232,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.tim,6207095,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.si,316,da39a3ee5e6b4b0d3255bfef95601890afd80709;_23.cfe,224,da39a3ee5e6b4b0d3255bfef95601890afd80709;segments.gen,20,da39a3ee5e6b4b0d3255bfef95601890afd80709;_6.cfs,5403594,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.fnm,61229,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.doc,2073360,da39a3ee5e6b4b0d3255bfef95601890afd80709;_6_1.del,51,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b_1.del,38,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.nvd,142931,da39a3ee5e6b4b0d3255bfef95601890afd80709;_b.nvm,46,da39a3ee5e6b4b0d3255bfef95601890afd80709;
 
{code}

Since the save happens every few seconds, still need to do something to prune 
the listings and to delete files no longer referenced. Probably best done with 
code that runs every few hours. 

Slightly concerned at the frequency of close or sync operations performed on 
the OakDirectory

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-10-30 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982734#comment-14982734
 ] 

Ian Boston commented on OAK-3547:
-

Patch in branch now passes build unit tests with the 
GenerationalDirectoryListing enabled.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OAK-3547) Improve ability of the OakDirectory to recover from unexpected file errors

2015-10-23 Thread Ian Boston (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971191#comment-14971191
 ] 

Ian Boston commented on OAK-3547:
-


https://github.com/apache/jackrabbit-oak/compare/trunk...ieb:OAK-3547?expand=1

Currently the patch makes no change, but puts the current behaviour behind an 
interface, and provides 2 implementations. A SimpleDirectoryListing that uses 
the current implementation and a GenerationalDirectoryListing that writes a new 
version of the node every time the listing is changed, as well as checking 
length and UUID of the file when the listing opens. A Checksum is not 
implemented, as that looked too expensive to achieve given the blob structure.

> Improve ability of the OakDirectory to recover from unexpected file errors
> --
>
> Key: OAK-3547
> URL: https://issues.apache.org/jira/browse/OAK-3547
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Affects Versions: 1.4
>Reporter: Ian Boston
>
> Currently if the OakDirectory finds that a file is missing or in some way 
> damaged, and exception is thrown which impacts all queries using that index, 
> at times making the index unavailable. This improvement aims to make the 
> OakDirectory recover to a previously ok state by storing which files were 
> involved in previous states, and giving the code some way of checking if they 
> are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)