Re: [openstack-dev] [swift] what does swift do if the auditor find that all 3 replicas are corrupt?

2013-11-09 Thread Daniel Li
Got it, thanks very much.


On Fri, Nov 8, 2013 at 2:32 AM, Samuel Merritt s...@swiftstack.com wrote:

 On 11/7/13 5:59 AM, Daniel Li wrote:


 Thanks very much for your help, and please see my inline
 comments/questions.

 On Thu, Nov 7, 2013 at 2:30 AM, Samuel Merritt s...@swiftstack.com
 mailto:s...@swiftstack.com wrote:

 On 11/6/13 7:12 AM, Daniel Li wrote:

 Hi,
   I have a question about swift:  what does swift do if the
 auditor
 find that all 3 replicas are corrupt?
 will it notify the owner of the object(email to the account
 owner)?
 what will happen if the GET request to the corrupted object?
 will it return a special error telling that all the replicas are
 corrupted?
Or will it just say that the object is not exist?
Or it just return one of the corrupted replica?
Or something else?


 If all 3 (or N) replicas are corrupt, then the auditors will
 eventually quarantine all of them, and subsequent GET requests will
 receive 404 responses.

 No notifications are sent, nor is it really feasible to start
 sending them. The auditor is not a single process; there is one
 Swift auditor process running on each node in a cluster. Therefore,
 when an object is quarantined, there's no way for its auditor to
 know if the other copies are okay or not.

 Note that this is highly unlikely to ever happen, at least with the
 default of 3 replicas. When an auditor finds a corrupt object, it
 quarantines it (moves it to a quarantines directory).

   Did you mean that when the auditor found the corruption, it did not
 copy good replica from other object server to overwrite the corrupted
 one, it just moved it to a quarantines directory?


 That is correct. The object auditors don't perform any network IO, and in
 fact do not use the ring at all. All they do is scan the filesystems and
 quarantine bad objects in an infinite loop.

 (Of course, there are also container and account auditors that do similar
 things, but for container and account databases.)


  Then, since that object is missing, the replication processes will
 recreate the object by copying it from a node with a good copy.

 When did the replication processes recreated the object by copying it
 from a node with a good copy? Does the auditor send a message to
 replication so the replication will do the copy immediately? And what is
 a 'good' copy? Does the good copy's MD5 value is checked before copying?


 It'll happen whenever the other replicators, which are running on other
 nodes, get around to it.

 Replication in Swift is push-based, not pull-based; there is no receiver
 here to which a message could be sent.

 Currently, a good copy is one that hasn't been quarantined. Since
 replication uses rsync to push files around the network, there's no
 checking of MD5 at copy time. However, there is work underway to develop a
 replication protocol that avoids rsync entirely and uses the object server
 throughout the entire replication process, and that would give the object
 server a chance to check MD5 checksums on incoming writes.

 Note that this is only important should 2 replicas experience
 near-simultaneous bitrot; in that case, there is a chance that bad-copy A
 will get quarantined and replaced with bad-copy B. Eventually, though, a
 bad copy will get quarantined and replaced with a good copy, and then
 you've got 2 good copies and 1 bad one, which reduces to a
 previously-discussed scenario.


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] what does swift do if the auditor find that all 3 replicas are corrupt?

2013-11-07 Thread Samuel Merritt

On 11/7/13 5:59 AM, Daniel Li wrote:


Thanks very much for your help, and please see my inline comments/questions.

On Thu, Nov 7, 2013 at 2:30 AM, Samuel Merritt s...@swiftstack.com
mailto:s...@swiftstack.com wrote:

On 11/6/13 7:12 AM, Daniel Li wrote:

Hi,
  I have a question about swift:  what does swift do if the
auditor
find that all 3 replicas are corrupt?
will it notify the owner of the object(email to the account owner)?
what will happen if the GET request to the corrupted object?
will it return a special error telling that all the replicas are
corrupted?
   Or will it just say that the object is not exist?
   Or it just return one of the corrupted replica?
   Or something else?


If all 3 (or N) replicas are corrupt, then the auditors will
eventually quarantine all of them, and subsequent GET requests will
receive 404 responses.

No notifications are sent, nor is it really feasible to start
sending them. The auditor is not a single process; there is one
Swift auditor process running on each node in a cluster. Therefore,
when an object is quarantined, there's no way for its auditor to
know if the other copies are okay or not.

Note that this is highly unlikely to ever happen, at least with the
default of 3 replicas. When an auditor finds a corrupt object, it
quarantines it (moves it to a quarantines directory).

  Did you mean that when the auditor found the corruption, it did not
copy good replica from other object server to overwrite the corrupted
one, it just moved it to a quarantines directory?


That is correct. The object auditors don't perform any network IO, and 
in fact do not use the ring at all. All they do is scan the filesystems 
and quarantine bad objects in an infinite loop.


(Of course, there are also container and account auditors that do 
similar things, but for container and account databases.)



Then, since that object is missing, the replication processes will
recreate the object by copying it from a node with a good copy.

When did the replication processes recreated the object by copying it
from a node with a good copy? Does the auditor send a message to
replication so the replication will do the copy immediately? And what is
a 'good' copy? Does the good copy's MD5 value is checked before copying?


It'll happen whenever the other replicators, which are running on other 
nodes, get around to it.


Replication in Swift is push-based, not pull-based; there is no receiver 
here to which a message could be sent.


Currently, a good copy is one that hasn't been quarantined. Since 
replication uses rsync to push files around the network, there's no 
checking of MD5 at copy time. However, there is work underway to develop 
a replication protocol that avoids rsync entirely and uses the object 
server throughout the entire replication process, and that would give 
the object server a chance to check MD5 checksums on incoming writes.


Note that this is only important should 2 replicas experience 
near-simultaneous bitrot; in that case, there is a chance that bad-copy 
A will get quarantined and replaced with bad-copy B. Eventually, though, 
a bad copy will get quarantined and replaced with a good copy, and then 
you've got 2 good copies and 1 bad one, which reduces to a 
previously-discussed scenario.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] what does swift do if the auditor find that all 3 replicas are corrupt?

2013-11-06 Thread Samuel Merritt

On 11/6/13 7:12 AM, Daniel Li wrote:

Hi,
 I have a question about swift:  what does swift do if the auditor
find that all 3 replicas are corrupt?
will it notify the owner of the object(email to the account owner)?
what will happen if the GET request to the corrupted object?
will it return a special error telling that all the replicas are corrupted?
  Or will it just say that the object is not exist?
  Or it just return one of the corrupted replica?
  Or something else?


If all 3 (or N) replicas are corrupt, then the auditors will eventually 
quarantine all of them, and subsequent GET requests will receive 404 
responses.


No notifications are sent, nor is it really feasible to start sending 
them. The auditor is not a single process; there is one Swift auditor 
process running on each node in a cluster. Therefore, when an object is 
quarantined, there's no way for its auditor to know if the other copies 
are okay or not.


Note that this is highly unlikely to ever happen, at least with the 
default of 3 replicas. When an auditor finds a corrupt object, it 
quarantines it (moves it to a quarantines directory). Then, since that 
object is missing, the replication processes will recreate the object by 
copying it from a node with a good copy. You'd need to have all replicas 
become corrupt within a very short timespan so that the replicators 
don't get a chance to replace the damaged ones.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev