Re: [Gluster-devel] relative ordering of writes to same file from two different fds
> On Sep 23, 2016, at 8:59 PM, Jeff Darcy wrote: > >>> write-behind: implement causal ordering and other cleanup >> >>> Rules of causal ordering implemented:¬ >> >>> - If request A arrives after the acknowledgement (to the app,¬ >> >>> i.e, STACK_UNWIND) of another request B, then request B is¬ >> >>> said to have 'caused' request A.¬ >> >> >> With the above principle, two write requests (p1 and p2 in example above) >> issued by _two different threads/processes_ there need _not always_ be a >> 'causal' relationship (whether there is a causal relationship is purely >> based on the "chance" that write-behind chose to ack one/both of them and >> their timing of arrival). > > I think this is an issue of terminology. While it's not *certain* that B > (or p1) caused A (or p2), it's *possible*. Contrast with the case where > they overlap, which could not possibly happen if the application were > trying to ensure order. In the distributed-system literature, this is > often referred to as a causal relationship even though it's really just > the possibility of one, because in most cases even the possibility means > that reordering would be unacceptable. > >> So, current write-behind is agnostic to the >> ordering of p1 and p2 (when done by two threads). >> >> However if p1 and p2 are issued by same thread there is _always_ a causal >> relationship (p2 being caused by p1). > > See above. If we feel bound to respect causal relationships, we have to > be pessimistic and assume that wherever such a relationship *could* exist > it *does* exist. However, as I explained in my previous message, I don't > think it's practical to provide such a guarantee across multiple clients, > and if we don't provide it across multiple clients then it's not worth > much to provide it on a single client. Applications that require such > strict ordering shouldn't use write-behind, or should explicitly flush > between writes. Otherwise they'll break unexpectedly when parts are > distributed across multiple nodes. Assuming that everything runs on one > node is the same mistake POSIX makes. The assumption was appropriate > for an earlier era, but not now for a decade or more. We can separate this into 2 question: 1. should it be a causal relationship in local application ? 2. should it be a causal relationship in a distribute application ? I think the answer to #2 is ’NO’. this is an issue that distribute application should resolve. the way to resolve it is either use distribute lock we provided or use their own way (fsync is required in such condition). I think the answer to #1 is ‘YES’. because buffer io should not involve new data consistency problem than no-buffer io. it’s very common that a local application will assume underlying file system to be. further more, compatible to linux page cache will always to be a better practice way, because there is a lot local applications that has already rely on its semantics. Thanks, Ryan ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
> compatible to linux page cache will always to be a better practice > way, because there is a lot local applications that has already rely > on its semantics. I don't think users even *know* how the page cache behaves. I don't think even its developers do, in the sense of being able to define it in sufficient detail for formal verification. Instead certain cases are intentionally left undefined - the "odd behavior" Ric mentioned - and can change any time the implementation does. What users have are expectations of things that are guaranteed and things that are not, and the wiser ones know to stay away from things in the second set even if they appear to work most of the time. As Raghavendra Talur points out, we already seem to provide normal linearizability across file descriptors on a single client. There doesn't seem to be much reason to change that. However, I still maintain that there's little value to the user in trying to satisfy stricter POSIX guarantees or closer approximations of whatever the page cache is doing that day. That's especially true since we run on multiple operating systems which almost certainly have different behavior in some of the edge cases. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
On Mon, Sep 26, 2016 at 1:05 PM, Ryan Ding wrote: > > > On Sep 23, 2016, at 8:59 PM, Jeff Darcy wrote: > > > >>> write-behind: implement causal ordering and other cleanup > >> > >>> Rules of causal ordering implemented:¬ > >> > >>> - If request A arrives after the acknowledgement (to the app,¬ > >> > >>> i.e, STACK_UNWIND) of another request B, then request B is¬ > >> > >>> said to have 'caused' request A.¬ > >> > >> > >> With the above principle, two write requests (p1 and p2 in example > above) > >> issued by _two different threads/processes_ there need _not always_ be a > >> 'causal' relationship (whether there is a causal relationship is purely > >> based on the "chance" that write-behind chose to ack one/both of them > and > >> their timing of arrival). > > > > I think this is an issue of terminology. While it's not *certain* that B > > (or p1) caused A (or p2), it's *possible*. Contrast with the case where > > they overlap, which could not possibly happen if the application were > > trying to ensure order. In the distributed-system literature, this is > > often referred to as a causal relationship even though it's really just > > the possibility of one, because in most cases even the possibility means > > that reordering would be unacceptable. > > > >> So, current write-behind is agnostic to the > >> ordering of p1 and p2 (when done by two threads). > >> > >> However if p1 and p2 are issued by same thread there is _always_ a > causal > >> relationship (p2 being caused by p1). > > > > See above. If we feel bound to respect causal relationships, we have to > > be pessimistic and assume that wherever such a relationship *could* exist > > it *does* exist. However, as I explained in my previous message, I don't > > think it's practical to provide such a guarantee across multiple clients, > > and if we don't provide it across multiple clients then it's not worth > > much to provide it on a single client. Applications that require such > > strict ordering shouldn't use write-behind, or should explicitly flush > > between writes. Otherwise they'll break unexpectedly when parts are > > distributed across multiple nodes. Assuming that everything runs on one > > node is the same mistake POSIX makes. The assumption was appropriate > > for an earlier era, but not now for a decade or more. > > We can separate this into 2 question: > 1. should it be a causal relationship in local application ? > 2. should it be a causal relationship in a distribute application ? > I think the answer to #2 is ’NO’. this is an issue that distribute > application should resolve. the way to resolve it is either use distribute > lock we provided or use their own way (fsync is required in such condition). > True. > I think the answer to #1 is ‘YES’. because buffer io should not involve > new data consistency problem than no-buffer io. it’s very common that a > local application will assume underlying file system to be. > further more, compatible to linux page cache will always to be a better > practice way, because there is a lot local applications that has already > rely on its semantics. > I agree. If my understanding, this is the same model that write-behind uses as of today if we don't consider the patch proposed. Write-behind orders all causal operations on the inode(file object) irrespective of the FD used. This particular patch brings a small modification where it lets the FSYNC and FLUSH FOPs bypass the order as long as they are not on the same FD as the pending WRITE FOP. Thanks, Raghavendra Talur > Thanks, > Ryan ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
> > write-behind: implement causal ordering and other cleanup > > > Rules of causal ordering implemented:¬ > > > - If request A arrives after the acknowledgement (to the app,¬ > > > i.e, STACK_UNWIND) of another request B, then request B is¬ > > > said to have 'caused' request A.¬ > > > With the above principle, two write requests (p1 and p2 in example above) > issued by _two different threads/processes_ there need _not always_ be a > 'causal' relationship (whether there is a causal relationship is purely > based on the "chance" that write-behind chose to ack one/both of them and > their timing of arrival). I think this is an issue of terminology. While it's not *certain* that B (or p1) caused A (or p2), it's *possible*. Contrast with the case where they overlap, which could not possibly happen if the application were trying to ensure order. In the distributed-system literature, this is often referred to as a causal relationship even though it's really just the possibility of one, because in most cases even the possibility means that reordering would be unacceptable. > So, current write-behind is agnostic to the > ordering of p1 and p2 (when done by two threads). > > However if p1 and p2 are issued by same thread there is _always_ a causal > relationship (p2 being caused by p1). See above. If we feel bound to respect causal relationships, we have to be pessimistic and assume that wherever such a relationship *could* exist it *does* exist. However, as I explained in my previous message, I don't think it's practical to provide such a guarantee across multiple clients, and if we don't provide it across multiple clients then it's not worth much to provide it on a single client. Applications that require such strict ordering shouldn't use write-behind, or should explicitly flush between writes. Otherwise they'll break unexpectedly when parts are distributed across multiple nodes. Assuming that everything runs on one node is the same mistake POSIX makes. The assumption was appropriate for an earlier era, but not now for a decade or more. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
On Wed, Sep 21, 2016 at 10:58 PM, Raghavendra Talur wrote: > > > On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler wrote: > >> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote: >> >>> Hi all, >>> >>> This mail is to figure out the behavior of write to same file from two >>> different fds. As Ryan quotes in one of comments, >>> >>> >>> >>> I think it’s not safe. in this case: >>> 1. P1 write to F1 use FD1 >>> 2. after P1 write finish, P2 write to the same place use FD2 >>> since they are not conflict with each other now, the order the 2 writes >>> send to underlying fs is not determined. so the final data may be P1’s or >>> P2’s. >>> this semantics is not the same with linux buffer io. linux buffer io >>> will make the second write cover the first one, this is to say the final >>> data is P2’s. >>> you can see it from linux NFS (as we are all network filesystem) >>> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request >>> first before another write begin. the way 2 request is determine to be >>> ‘incompatible’ is that they are from 2 different open fds. >>> I think write-behind behaviour should keep the same with linux page >>> cache. >>> >>> >>> >> >> I think that how this actually would work is that both would be written >> to the same page in the page cache (if not using buffered IO), so as long >> as they do not happen at the same time, you would get the second P2 copy of >> data each time. >> > > I apologize if my understanding is wrong but IMO this is exactly what we > do in write-behind too. The cache is inode based and ensures that writes > are ordered irrespective of the FD used for the write. > > > Here is the commit message which brought the change > > - > write-behind: implement causal ordering and other cleanup > > > Rules of causal ordering implemented:¬ > > > > > > > - If request A arrives after the acknowledgement (to the app,¬ > >i.e, STACK_UNWIND) of another request B, then request B is¬ > >said to have 'caused' request A.¬ > > With the above principle, two write requests (p1 and p2 in example above) issued by _two different threads/processes_ there need _not always_ be a 'causal' relationship (whether there is a causal relationship is purely based on the "chance" that write-behind chose to ack one/both of them and their timing of arrival). So, current write-behind is agnostic to the ordering of p1 and p2 (when done by two threads). However if p1 and p2 are issued by same thread there is _always_ a causal relationship (p2 being caused by p1). > > > - (corollary) Two requests, which at any point of time, are¬ > >unacknowledged simultaneously in the system can never 'cause'¬ > >each other (wb_inode->gen is based on this)¬ > > > > - If request A is caused by request B, AND request A's region¬ > >has an overlap with request B's region, then then the fulfillment¬ > >of request A is guaranteed to happen after the fulfillment of B.¬ > > > > - FD of origin is not considered for the determination of causal¬ > >ordering.¬ > > > > - Append operation's region is considered the whole file.¬ > > > > Other cleanup:¬ > > > > - wb_file_t not required any more.¬ > > > > - wb_local_t not required any more.¬ > > > > - O_RDONLY fd's operations now go through the queue to make sure¬ > >writes in the requested region get fulfilled be > > --- > > Thanks, > Raghavendra Talur > > >> >> Same story for using O_DIRECT - that write bypasses the page cache and >> will update the data directly. >> >> What might happen in practice though is that your applications might use >> higher level IO routines and they might buffer data internally. If that >> happens, there is no ordering that is predictable. >> >> Regards, >> >> Ric >> >> >> >>> However, my understanding is that filesystems need not maintain the >>> relative order of writes (as it received from vfs/kernel) on two different >>> fds. Also, if we have to maintain the order it might come with increased >>> latency. The increased latency can be because of having "newer" writes to >>> wait on "older" ones. This wait can fill up write-behind buffer and can >>> eventually result in a full write-behind cache and hence not able to >>> "write-back" newer writes. >>> >>> * What does POSIX say about it? >>> * How do other filesystems behave in this scenario? >>> >>> >>> Also, the current write-behind implementation has the concept of >>> "generation numbers". To quote from comment: >>> >>> >>> >>> uint64_t gen;/* Liability generation number. Represents >>> the current 'state' of liability. Every >>> new addition to the liability list bumps >>> the generation number. >>> >>> >>>
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
> I don't understand the Jeff snippet above - if they are > non-overlapping writes to dfferent offsets, this would never happen. The question is not whether it *would* happen, but whether it would be *allowed* to happen, and my point is that POSIX is often a poor guide. Sometimes it's unreasonably strict, sometimes it's very lax. That said, my example was kind of bad because it doesn't actually work unless issues of durability are brought in. Let's say that there's a crash between the writes and the reads. (It's not even clear when POSIX would consider a distributed system to have crashed. Let's just say *everything* dies.) While the strict write requirements apply to the non-durable state before it's flushed, and thus affect what gets flushed when writes overlap, it's entirely permissible for non-overlapping writes to be flushed out of order. This is even quite likely if the writes are on different file descriptors. http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html > If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily > on the conformance document to tell the user what can be expected from > the system. It is explicitly intended that a null implementation is > permitted. That's my absolute favorite part of POSIX, by the way. It amounts to "do whatever you want" in standards language. What this really means is that, when the system comes back up, the results of the second write could be available even though the first was lost. I'm not saying it happens. I'm not saying it's good or useful behavior. I'm just saying the standard permits it. > If they are the same offset and at the same time, then you can have an > undefined results where you might get fragments of A and fragments of > B (where you might be able to see some odd things if the write spans > pages/blocks). This is where POSIX goes the other way and *over*specifies behavior. Normal linearizability requires that an action appear to be atomic at *some* point between issuance and completion. However, the POSIX "after a write" wording forces this to be at the exact moment of completion. It's not undefined. If two writes overlap in both space and time, the one that completes last *must* win. Those "odd things" you mention might be considered non-conformance with the standard. Fortunately, Linux is not POSIX. Linus and others have been quite clear on that. As much as I've talked about formal standards here, "what you can get away with" is the real standard. The page-cache behavior that local filesystems rely on is IMO a poor guide, because extending that behavior across physical systems is difficult to do completely and impossible to do without impacting performance. What matters is whether users will accept this kind of reordering. Here's what I think: (1) An expectation of ordering is only valid if the order is completely unambiguous. (2) This can only be the case if there was some coordination between when the first write completes and when the second is issued. (3) The coordinating entities could be on different machines, in which case the potential for reordering is unavoidable (short of us adding write-behind serialization across all clients). (4) If it's unavoidable in the distributed case, there's not much value in trying to make it airtight in the local case. In other words, standards aside, I'm kind of with Raghavendra on this. We shouldn't add this much complexity and possibly degrade performance unless we can provide a *meaningful guarantee* to users, and this area is already such a swamp that any user relying on particular behavior is likely to get themselves in trouble no matter what we do. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
On 09/21/2016 08:58 PM, Jeff Darcy wrote: However, my understanding is that filesystems need not maintain the relative order of writes (as it received from vfs/kernel) on two different fds. Also, if we have to maintain the order it might come with increased latency. The increased latency can be because of having "newer" writes to wait on "older" ones. This wait can fill up write-behind buffer and can eventually result in a full write-behind cache and hence not able to "write-back" newer writes. IEEE 1003.1, 2013 edition http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html After a write() to a regular file has successfully returned: Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until >such byte positions are again modified. Any subsequent successful write() to the same byte position in the file shall overwrite that file data. Note that the reference is to a *file*, not to a file *descriptor*. It's an application of the general POSIX assumption that time is simple, locking is cheap (if it's even necessary), and therefore time-based requirements like linearizability - what this is - are easy to satisfy. I know that's not very realistic nowadays, but it's pretty clear: according to the standard as it's still written, P2's write *is* required to overwrite P1's. Same vs. different fd or process/thread doesn't even come into play. Just for fun, I'll point out that the standard snippet above doesn't say anything about *non overlapping* writes. Does POSIX allow the following? write A write B read B, get new value read A, get *old* value This is a non-linearizable result, which would surely violate some people's (notably POSIX authors') expectations, but good luck finding anything in that standard which actually precludes it. I will reply to both comments here. First, I think that all file systems will perform this way since this is really a function of how the page cache works and O_DIRECT. More broadly, this is not a promise or hard and fast thing - the traditional way applications that do concurrent writes is to make sure that they use either whole file or byte range locking when one or more threads/processes are doing IO to the same file concurrently. I don't understand the Jeff snippet above - if they are non-overlapping writes to dfferent offsets, this would never happen. If the writes are to the same offset and happened at different times, it would not happen either. If they are the same offset and at the same time, then you can have an undefined results where you might get fragments of A and fragments of B (where you might be able to see some odd things if the write spans pages/blocks). This last case is where the normal best practice comes in to suggest using locking. Ric ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
- Original Message - > From: "Ric Wheeler" > To: "Raghavendra Gowdappa" , "Gluster Devel" > > Cc: "ryan ding" > Sent: Wednesday, September 21, 2016 6:32:29 PM > Subject: Re: [Gluster-devel] relative ordering of writes to same file from > two different fds > > On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote: > > Hi all, > > > > This mail is to figure out the behavior of write to same file from two > > different fds. As Ryan quotes in one of comments, > > > > > > > > I think it’s not safe. in this case: > > 1. P1 write to F1 use FD1 > > 2. after P1 write finish, P2 write to the same place use FD2 > > since they are not conflict with each other now, the order the 2 writes > > send to underlying fs is not determined. so the final data may be P1’s or > > P2’s. > > this semantics is not the same with linux buffer io. linux buffer io will > > make the second write cover the first one, this is to say the final data > > is P2’s. > > you can see it from linux NFS (as we are all network filesystem) > > fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request > > first before another write begin. the way 2 request is determine to be > > ‘incompatible’ is that they are from 2 different open fds. > > I think write-behind behaviour should keep the same with linux page cache. > > > > > > I think that how this actually would work is that both would be written to > the > same page in the page cache (if not using buffered IO), so as long as they do > not happen at the same time, you would get the second P2 copy of data each > time. > > Same story for using O_DIRECT - that write bypasses the page cache and will > update the data directly. > > What might happen in practice though is that your applications might use > higher > level IO routines and they might buffer data internally. If that happens, > there > is no ordering that is predictable. Thanks Ric. 1. Are filesytems required to maintain that order? 2. Even if there is no such requirement, would there be any benefit in filesystems enforcing that order (probably at the cost of increased latency). regards, Raghavendra > > Regards, > > Ric > > > > > However, my understanding is that filesystems need not maintain the > > relative order of writes (as it received from vfs/kernel) on two different > > fds. Also, if we have to maintain the order it might come with increased > > latency. The increased latency can be because of having "newer" writes to > > wait on "older" ones. This wait can fill up write-behind buffer and can > > eventually result in a full write-behind cache and hence not able to > > "write-back" newer writes. > > > > * What does POSIX say about it? > > * How do other filesystems behave in this scenario? > > > > > > Also, the current write-behind implementation has the concept of > > "generation numbers". To quote from comment: > > > > > > > > uint64_t gen;/* Liability generation number. Represents > > the current 'state' of liability. Every > > new addition to the liability list bumps > > the generation number. > > > > > > > > a newly arrived request is only required > > to perform causal checks against the > > entries > > in the liability list which were present > > at the time of its addition. the > > generation > > number at the time of its addition is > > stored > > in the request and used during checks. > > > > > > > > the liability list can grow while the > > request > > waits in the todo list waiting for its > > dependent operations to complete. however > >
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
> However, my understanding is that filesystems need not maintain the relative > order of writes (as it received from vfs/kernel) on two different fds. Also, > if we have to maintain the order it might come with increased latency. The > increased latency can be because of having "newer" writes to wait on "older" > ones. This wait can fill up write-behind buffer and can eventually result in > a full write-behind cache and hence not able to "write-back" newer writes. IEEE 1003.1, 2013 edition http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html > After a write() to a regular file has successfully returned: > > Any successful read() from each byte position in the file that was > modified by that write shall return the data specified by the write() > for that position until >such byte positions are again modified. > > Any subsequent successful write() to the same byte position in the > file shall overwrite that file data. Note that the reference is to a *file*, not to a file *descriptor*. It's an application of the general POSIX assumption that time is simple, locking is cheap (if it's even necessary), and therefore time-based requirements like linearizability - what this is - are easy to satisfy. I know that's not very realistic nowadays, but it's pretty clear: according to the standard as it's still written, P2's write *is* required to overwrite P1's. Same vs. different fd or process/thread doesn't even come into play. Just for fun, I'll point out that the standard snippet above doesn't say anything about *non overlapping* writes. Does POSIX allow the following? write A write B read B, get new value read A, get *old* value This is a non-linearizable result, which would surely violate some people's (notably POSIX authors') expectations, but good luck finding anything in that standard which actually precludes it. ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler wrote: > On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote: > >> Hi all, >> >> This mail is to figure out the behavior of write to same file from two >> different fds. As Ryan quotes in one of comments, >> >> >> >> I think it’s not safe. in this case: >> 1. P1 write to F1 use FD1 >> 2. after P1 write finish, P2 write to the same place use FD2 >> since they are not conflict with each other now, the order the 2 writes >> send to underlying fs is not determined. so the final data may be P1’s or >> P2’s. >> this semantics is not the same with linux buffer io. linux buffer io will >> make the second write cover the first one, this is to say the final data is >> P2’s. >> you can see it from linux NFS (as we are all network filesystem) >> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request >> first before another write begin. the way 2 request is determine to be >> ‘incompatible’ is that they are from 2 different open fds. >> I think write-behind behaviour should keep the same with linux page cache. >> >> >> > > I think that how this actually would work is that both would be written to > the same page in the page cache (if not using buffered IO), so as long as > they do not happen at the same time, you would get the second P2 copy of > data each time. > I apologize if my understanding is wrong but IMO this is exactly what we do in write-behind too. The cache is inode based and ensures that writes are ordered irrespective of the FD used for the write. Here is the commit message which brought the change - write-behind: implement causal ordering and other cleanup Rules of causal ordering implemented:¬ - If request A arrives after the acknowledgement (to the app,¬ i.e, STACK_UNWIND) of another request B, then request B is¬ said to have 'caused' request A.¬ - (corollary) Two requests, which at any point of time, are¬ unacknowledged simultaneously in the system can never 'cause'¬ each other (wb_inode->gen is based on this)¬ - If request A is caused by request B, AND request A's region¬ has an overlap with request B's region, then then the fulfillment¬ of request A is guaranteed to happen after the fulfillment of B.¬ - FD of origin is not considered for the determination of causal¬ ordering.¬ - Append operation's region is considered the whole file.¬ Other cleanup:¬ - wb_file_t not required any more.¬ - wb_local_t not required any more.¬ - O_RDONLY fd's operations now go through the queue to make sure¬ writes in the requested region get fulfilled be --- Thanks, Raghavendra Talur > > Same story for using O_DIRECT - that write bypasses the page cache and > will update the data directly. > > What might happen in practice though is that your applications might use > higher level IO routines and they might buffer data internally. If that > happens, there is no ordering that is predictable. > > Regards, > > Ric > > > >> However, my understanding is that filesystems need not maintain the >> relative order of writes (as it received from vfs/kernel) on two different >> fds. Also, if we have to maintain the order it might come with increased >> latency. The increased latency can be because of having "newer" writes to >> wait on "older" ones. This wait can fill up write-behind buffer and can >> eventually result in a full write-behind cache and hence not able to >> "write-back" newer writes. >> >> * What does POSIX say about it? >> * How do other filesystems behave in this scenario? >> >> >> Also, the current write-behind implementation has the concept of >> "generation numbers". To quote from comment: >> >> >> >> uint64_t gen;/* Liability generation number. Represents >> the current 'state' of liability. Every >> new addition to the liability list bumps >> the generation number. >> >> >> a newly arrived request >> is only required >> to perform causal checks against the >> entries >> in the liability list which were present >> at the time of its addition. the >> generation >> number at the time of its addition is >> stored >> in the request and used during checks. >> >> >> the liability list can >> grow while the request >> waits in the todo list waiting for its >> dependent operations to complete. however >> it is not of the request's c
Re: [Gluster-devel] relative ordering of writes to same file from two different fds
On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote: Hi all, This mail is to figure out the behavior of write to same file from two different fds. As Ryan quotes in one of comments, I think it’s not safe. in this case: 1. P1 write to F1 use FD1 2. after P1 write finish, P2 write to the same place use FD2 since they are not conflict with each other now, the order the 2 writes send to underlying fs is not determined. so the final data may be P1’s or P2’s. this semantics is not the same with linux buffer io. linux buffer io will make the second write cover the first one, this is to say the final data is P2’s. you can see it from linux NFS (as we are all network filesystem) fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request first before another write begin. the way 2 request is determine to be ‘incompatible’ is that they are from 2 different open fds. I think write-behind behaviour should keep the same with linux page cache. I think that how this actually would work is that both would be written to the same page in the page cache (if not using buffered IO), so as long as they do not happen at the same time, you would get the second P2 copy of data each time. Same story for using O_DIRECT - that write bypasses the page cache and will update the data directly. What might happen in practice though is that your applications might use higher level IO routines and they might buffer data internally. If that happens, there is no ordering that is predictable. Regards, Ric However, my understanding is that filesystems need not maintain the relative order of writes (as it received from vfs/kernel) on two different fds. Also, if we have to maintain the order it might come with increased latency. The increased latency can be because of having "newer" writes to wait on "older" ones. This wait can fill up write-behind buffer and can eventually result in a full write-behind cache and hence not able to "write-back" newer writes. * What does POSIX say about it? * How do other filesystems behave in this scenario? Also, the current write-behind implementation has the concept of "generation numbers". To quote from comment: uint64_t gen;/* Liability generation number. Represents the current 'state' of liability. Every new addition to the liability list bumps the generation number. a newly arrived request is only required to perform causal checks against the entries in the liability list which were present at the time of its addition. the generation number at the time of its addition is stored in the request and used during checks. the liability list can grow while the request waits in the todo list waiting for its dependent operations to complete. however it is not of the request's concern to depend itself on those new entries which arrived after it arrived (i.e, those that have a liability generation higher than itself) */ So, if a single thread is doing writes on two different fds, generation numbers are sufficient to enforce the relative ordering. If writes are from two different threads/processes, I think write-behind is not obligated to maintain their order. Comments? [1] http://review.gluster.org/#/c/15380/ regards, Raghavendra ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel