[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

2019-12-05 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989113#comment-16989113
 ] 

David Smiley commented on SOLR-13872:
-

This is a really impressive investigation Hoss; I am in awe!

>  Backup can fail to read index files w/NoSuchFileException during merges 
> (SOLR-11616 regression)
> 
>
> Key: SOLR-13872
> URL: https://issues.apache.org/jira/browse/SOLR-13872
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13872.patch, SOLR-13872.patch, SOLR-13872.patch, 
> SOLR-13872.patch, index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 
> 'NoSuchFileException' errors when attempting to backup an index while it is 
> undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to 
> demonstrate that this bug still exists on master, branch_8x, and even in 7.2 
> - so it seems less like the current problem is a "regression" and more that 
> the original fix didn't work.
> 
> The crux of the problem seems to be concurrency bugs in if/how a commit is 
> "reserved" before attempting to copy the files in that commit to the backup 
> location.  
> A possible work around discussed in more depth in the comments below is to 
> update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} 
> with either the {{maxCommitsToKeep}} or {{maxCommitAge}} options to ensure 
> the commits are kept around long enough for the backup to be created.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

2019-11-15 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975259#comment-16975259
 ] 

Chris M. Hostetter commented on SOLR-13872:
---

FYI: the jenkins job {{thetaphi/Lucene-Solr-master-Windows/8235}} recently 
failed in {{org.apache.solr.handler.TestReplicationHandler.testEmptyBackup}} 
due to {{empty_backup2}} not being on disk where/when the test expected it -- 
this is due to me (foolishly) using CheckBackupStatus in that method, which i 
later realized isn't suitable for dealing with using in a test with more then 
one backup.

This is already fixed in the patch in SOLR-13909

>  Backup can fail to read index files w/NoSuchFileException during merges 
> (SOLR-11616 regression)
> 
>
> Key: SOLR-13872
> URL: https://issues.apache.org/jira/browse/SOLR-13872
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: master (9.0), 8.4
>
> Attachments: SOLR-13872.patch, SOLR-13872.patch, SOLR-13872.patch, 
> SOLR-13872.patch, index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 
> 'NoSuchFileException' errors when attempting to backup an index while it is 
> undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to 
> demonstrate that this bug still exists on master, branch_8x, and even in 7.2 
> - so it seems less like the current problem is a "regression" and more that 
> the original fix didn't work.
> 
> The crux of the problem seems to be concurrency bugs in if/how a commit is 
> "reserved" before attempting to copy the files in that commit to the backup 
> location.  
> A possible work around discussed in more depth in the comments below is to 
> update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} 
> with either the {{maxCommitsToKeep}} or {{maxCommitAge}} options to ensure 
> the commits are kept around long enough for the backup to be created.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

2019-11-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973517#comment-16973517
 ] 

ASF subversion and git services commented on SOLR-13872:


Commit 8c12979fddd4fe822a300d1b04d49f93b5106916 in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8c12979 ]

SOLR-13872: Fixed Backup failures due to race conditions in saving/reserving 
commit points

(cherry picked from commit 30e55e2b6efc55c04761b80c22a106f4a1115722)


>  Backup can fail to read index files w/NoSuchFileException during merges 
> (SOLR-11616 regression)
> 
>
> Key: SOLR-13872
> URL: https://issues.apache.org/jira/browse/SOLR-13872
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13872.patch, SOLR-13872.patch, SOLR-13872.patch, 
> SOLR-13872.patch, index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 
> 'NoSuchFileException' errors when attempting to backup an index while it is 
> undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to 
> demonstrate that this bug still exists on master, branch_8x, and even in 7.2 
> - so it seems less like the current problem is a "regression" and more that 
> the original fix didn't work.
> 
> The crux of the problem seems to be concurrency bugs in if/how a commit is 
> "reserved" before attempting to copy the files in that commit to the backup 
> location.  
> A possible work around discussed in more depth in the comments below is to 
> update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} 
> with either the {{maxCommitsToKeep}} or {{maxCommitAge}} options to ensure 
> the commits are kept around long enough for the backup to be created.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

2019-11-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973470#comment-16973470
 ] 

ASF subversion and git services commented on SOLR-13872:


Commit 30e55e2b6efc55c04761b80c22a106f4a1115722 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=30e55e2 ]

SOLR-13872: Fixed Backup failures due to race conditions in saving/reserving 
commit points


>  Backup can fail to read index files w/NoSuchFileException during merges 
> (SOLR-11616 regression)
> 
>
> Key: SOLR-13872
> URL: https://issues.apache.org/jira/browse/SOLR-13872
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13872.patch, SOLR-13872.patch, SOLR-13872.patch, 
> SOLR-13872.patch, index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 
> 'NoSuchFileException' errors when attempting to backup an index while it is 
> undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to 
> demonstrate that this bug still exists on master, branch_8x, and even in 7.2 
> - so it seems less like the current problem is a "regression" and more that 
> the original fix didn't work.
> 
> The crux of the problem seems to be concurrency bugs in if/how a commit is 
> "reserved" before attempting to copy the files in that commit to the backup 
> location.  
> A possible work around discussed in more depth in the comments below is to 
> update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} 
> with either the {{maxCommitsToKeep}} or {{maxCommitAge}} options to ensure 
> the commits are kept around long enough for the backup to be created.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

2019-11-12 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972973#comment-16972973
 ] 

Lucene/Solr QA commented on SOLR-13872:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m  5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m  2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m  2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate ref guide {color} | 
{color:green}  1m  2s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 46m 
17s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 51m  4s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-13872 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12985582/SOLR-13872.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  validaterefguide  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 5df9a51cbfb |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/595/testReport/ |
| modules | C: solr/core solr/solr-ref-guide U: solr |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/595/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



>  Backup can fail to read index files w/NoSuchFileException during merges 
> (SOLR-11616 regression)
> 
>
> Key: SOLR-13872
> URL: https://issues.apache.org/jira/browse/SOLR-13872
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-13872.patch, SOLR-13872.patch, SOLR-13872.patch, 
> SOLR-13872.patch, index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 
> 'NoSuchFileException' errors when attempting to backup an index while it is 
> undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to 
> demonstrate that this bug still exists on master, branch_8x, and even in 7.2 
> - so it seems less like the current problem is a "regression" and more that 
> the original fix didn't work.
> 
> The crux of the problem seems to be concurrency bugs in if/how a commit is 
> "reserved" before attempting to copy the files in that commit to the backup 
> location.  
> A possible work around discussed in more depth in the comments below is to 
> update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} 
> with either the {{maxCommitsToKeep}} or {{maxCommitAge}} options to ensure 
> the commits are kept around long enough for the backup to be created.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13872) Backup can fail to read index files w/NoSuchFileException during merges (SOLR-11616 regression)

2019-10-29 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962282#comment-16962282
 ] 

Chris M. Hostetter commented on SOLR-13872:
---


My first inclination is that we'll need to:
* replace {{getLatestCommit()}} with something that (atomicly) does a 
{{reserveAndGetLatestCommit()}} type operation
* make {{saveCommitPoint(gen)}} smart enough the error if the specified 
generation is already deleted, or if it's less then the latest commit and 
"unknown" in the current set of commits (ie: last time 
{{updateCommitPoints(..)}} was called)
* add a lot of synchronization between the methods that "reserve" a commit and 
the {{IndexDeletionPolicy}} abstraction methods that can result in a commit 
being deleted -- we need to make sure that if someone is reserving a commit 
there is no thread-safety/race condition if the IndexWriter is concurrently 
asking (us to ask) the delegate {{IndexDeletionPolicy}} to delete that commit.

...but while all of those things are almost certianly neccessary, i'm not sure 
they are sufficient -- in particular we need to be careful about the order of 
operations: currently IDPW doesn't invoke it's {{updateCommitPoints(...)}} 
method (to modify it's internal state) until *after* it's delegated the 
onInit/onCommit calls from the IndexWriter... which means depending on 
where/how we synchrnize, and where/how exactly we "mark" reservations, we 
mightstill wind up in a situation where we tell a caller we've reserved a 
commit, right after it's deleted, but before we *KNOW* it's deleted.

Adding to my headaches is confusion about the way "named snapshots" and the way 
{{IndexDeletionPolicyWrapper}} depends on (and consults) 
{{SolrSnapshotMetaDataManager}} to know if some commits should be saved (even 
if not reserved) and if/how the thread safety of "naming" these snapshoots 
works (or should work).

It's weird to me that the relationship isn't reversed ... that 
{{SolrSnapshotMetaDataManager}} should call {{IDPW.saveCommitPoint(gen)}} / 
{{IDPW.releaseCommitPoint(gen)}} instead of eacy {{IndexCommitWrapper}} asking 
the {{SolrSnapshotMetaDataManager}} if it can be deleted ... i keep second 
guessing why it works that way and what i might be missunderstanding and what 
thread safety issues i might not be thinking about as a result. ... i need to 
spend more time wrapping my head around these "named snapshots" and their 
lifecycles and all ofthe code paths that touch that.



Also, a quick followup to the previously suggested workaround...

In reading up on the code, i realized that even though all of the underlying 
code paths using IndexDeletionPolicyWrapper have these thread safety issues, 
the SolrDeletionPolicy configured workaround should work "better" when using 
the SolrCloud collection API "BACKUP" action, or the (undocumented) CoreAdmin 
"BACKUP" action instead of using the {{/replication}} handler as i was in my 
testing for a single core.

The Core/Collection BACKUP actions *attempt* to reserve the current commit 
before the start of the file copying – meaning that instead of needing to 
configure large enough values of {{maxCommitsToKeep}} or {{maxCommitAge}} to 
ensure that the commit is reserved for the entire duration of the backup 
process (and all of the underlying disk IO), you can (in theory) use smaller 
values because you only need to configure them to reserve a commit long enough 
for the SnapShooter code to have a chance to start and do it's own reservation.

>  Backup can fail to read index files w/NoSuchFileException during merges 
> (SOLR-11616 regression)
> 
>
> Key: SOLR-13872
> URL: https://issues.apache.org/jira/browse/SOLR-13872
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Priority: Major
> Attachments: index_churn.pl
>
>
> SOLR-11616 purports to fix a bug in Solr's backup functionality that causes 
> 'NoSuchFileException' errors when attempting to backup an index while it is 
> undergoing indexing (and segment merging)
> Although SOLR-11616 is marked with "Fix Version: 7.2" it's pretty easy to 
> demonstrate that this bug still exists on master, branch_8x, and even in 7.2 
> - so it seems less like the current problem is a "regression" and more that 
> the original fix didn't work.
> 
> The crux of the problem seems to be concurrency bugs in if/how a commit is 
> "reserved" before attempting to copy the files in that commit to the backup 
> location.  
> A possible work around discussed in more depth in the comments below is to 
> update {{solrconfig.xml}} to explicitly configure the {{SolrDeletionPolicy}} 
> with either the {{maxCommitsToKeep}} or {{maxCommitAge}}