subject:"\[Gluster\-devel\] relative ordering of writes to same file from two different fds"

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-26 Thread Ryan Ding


> On Sep 23, 2016, at 8:59 PM, Jeff Darcy  wrote:
> 
>>> write-behind: implement causal ordering and other cleanup
>> 
>>> Rules of causal ordering implemented:¬
>> 
>>> - If request A arrives after the acknowledgement (to the app,¬
>> 
>>> i.e, STACK_UNWIND) of another request B, then request B is¬
>> 
>>> said to have 'caused' request A.¬
>> 
>> 
>> With the above principle, two write requests (p1 and p2 in example above)
>> issued by _two different threads/processes_ there need _not always_ be a
>> 'causal' relationship (whether there is a causal relationship is purely
>> based on the "chance" that write-behind chose to ack one/both of them and
>> their timing of arrival).
> 
> I think this is an issue of terminology.  While it's not *certain* that B
> (or p1) caused A (or p2), it's *possible*.  Contrast with the case where
> they overlap, which could not possibly happen if the application were
> trying to ensure order.  In the distributed-system literature, this is
> often referred to as a causal relationship even though it's really just
> the possibility of one, because in most cases even the possibility means
> that reordering would be unacceptable.
> 
>> So, current write-behind is agnostic to the
>> ordering of p1 and p2 (when done by two threads).
>> 
>> However if p1 and p2 are issued by same thread there is _always_ a causal
>> relationship (p2 being caused by p1).
> 
> See above.  If we feel bound to respect causal relationships, we have to
> be pessimistic and assume that wherever such a relationship *could* exist
> it *does* exist.  However, as I explained in my previous message, I don't
> think it's practical to provide such a guarantee across multiple clients,
> and if we don't provide it across multiple clients then it's not worth
> much to provide it on a single client.  Applications that require such
> strict ordering shouldn't use write-behind, or should explicitly flush
> between writes.  Otherwise they'll break unexpectedly when parts are
> distributed across multiple nodes.  Assuming that everything runs on one
> node is the same mistake POSIX makes.  The assumption was appropriate
> for an earlier era, but not now for a decade or more.

We can separate this into 2 question:
1. should it be a causal relationship in local application ?
2. should it be a causal relationship in a distribute application ?
I think the answer to #2 is ’NO’. this is an issue that distribute application 
should resolve. the way to resolve it is either use distribute lock we provided 
or use their own way (fsync is required in such condition).
I think the answer to #1 is ‘YES’. because buffer io should not involve new 
data consistency problem than no-buffer io. it’s very common that a local 
application will assume underlying file system to be.
further more, compatible to linux page cache will always to be a better 
practice way, because there is a lot local applications that has already rely 
on its semantics.

Thanks,
Ryan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-26 Thread Jeff Darcy

> compatible to linux page cache will always to be a better practice
> way, because there is a lot local applications that has already rely
> on its semantics.

I don't think users even *know* how the page cache behaves.  I don't
think even its developers do, in the sense of being able to define it in
sufficient detail for formal verification.  Instead certain cases are
intentionally left undefined - the "odd behavior" Ric mentioned - and
can change any time the implementation does.  What users have are
expectations of things that are guaranteed and things that are not, and
the wiser ones know to stay away from things in the second set even if
they appear to work most of the time.

As Raghavendra Talur points out, we already seem to provide normal
linearizability across file descriptors on a single client.  There
doesn't seem to be much reason to change that.  However, I still
maintain that there's little value to the user in trying to satisfy
stricter POSIX guarantees or closer approximations of whatever the
page cache is doing that day.  That's especially true since we run on
multiple operating systems which almost certainly have different
behavior in some of the edge cases.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-26 Thread Raghavendra Talur

On Mon, Sep 26, 2016 at 1:05 PM, Ryan Ding  wrote:

>
> > On Sep 23, 2016, at 8:59 PM, Jeff Darcy  wrote:
> >
> >>> write-behind: implement causal ordering and other cleanup
> >>
> >>> Rules of causal ordering implemented:¬
> >>
> >>> - If request A arrives after the acknowledgement (to the app,¬
> >>
> >>> i.e, STACK_UNWIND) of another request B, then request B is¬
> >>
> >>> said to have 'caused' request A.¬
> >>
> >>
> >> With the above principle, two write requests (p1 and p2 in example
> above)
> >> issued by _two different threads/processes_ there need _not always_ be a
> >> 'causal' relationship (whether there is a causal relationship is purely
> >> based on the "chance" that write-behind chose to ack one/both of them
> and
> >> their timing of arrival).
> >
> > I think this is an issue of terminology.  While it's not *certain* that B
> > (or p1) caused A (or p2), it's *possible*.  Contrast with the case where
> > they overlap, which could not possibly happen if the application were
> > trying to ensure order.  In the distributed-system literature, this is
> > often referred to as a causal relationship even though it's really just
> > the possibility of one, because in most cases even the possibility means
> > that reordering would be unacceptable.
> >
> >> So, current write-behind is agnostic to the
> >> ordering of p1 and p2 (when done by two threads).
> >>
> >> However if p1 and p2 are issued by same thread there is _always_ a
> causal
> >> relationship (p2 being caused by p1).
> >
> > See above.  If we feel bound to respect causal relationships, we have to
> > be pessimistic and assume that wherever such a relationship *could* exist
> > it *does* exist.  However, as I explained in my previous message, I don't
> > think it's practical to provide such a guarantee across multiple clients,
> > and if we don't provide it across multiple clients then it's not worth
> > much to provide it on a single client.  Applications that require such
> > strict ordering shouldn't use write-behind, or should explicitly flush
> > between writes.  Otherwise they'll break unexpectedly when parts are
> > distributed across multiple nodes.  Assuming that everything runs on one
> > node is the same mistake POSIX makes.  The assumption was appropriate
> > for an earlier era, but not now for a decade or more.
>
> We can separate this into 2 question:
> 1. should it be a causal relationship in local application ?
> 2. should it be a causal relationship in a distribute application ?
> I think the answer to #2 is ’NO’. this is an issue that distribute
> application should resolve. the way to resolve it is either use distribute
> lock we provided or use their own way (fsync is required in such condition).
>
True.


> I think the answer to #1 is ‘YES’. because buffer io should not involve
> new data consistency problem than no-buffer io. it’s very common that a
> local application will assume underlying file system to be.
> further more, compatible to linux page cache will always to be a better
> practice way, because there is a lot local applications that has already
> rely on its semantics.
>

I agree. If my understanding, this is the same model that write-behind uses
as of today if we don't consider the patch proposed. Write-behind orders
all causal operations on the inode(file object) irrespective of the FD used.
This particular patch brings a small modification where it lets the FSYNC
and FLUSH FOPs bypass the order as long as they are not on the same FD as
the pending WRITE FOP.

Thanks,
Raghavendra Talur


> Thanks,
> Ryan
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-23 Thread Jeff Darcy

> > write-behind: implement causal ordering and other cleanup
> 
> > Rules of causal ordering implemented:¬
> 
> > - If request A arrives after the acknowledgement (to the app,¬
> 
> > i.e, STACK_UNWIND) of another request B, then request B is¬
> 
> > said to have 'caused' request A.¬
> 
>
> With the above principle, two write requests (p1 and p2 in example above)
> issued by _two different threads/processes_ there need _not always_ be a
> 'causal' relationship (whether there is a causal relationship is purely
> based on the "chance" that write-behind chose to ack one/both of them and
> their timing of arrival).

I think this is an issue of terminology.  While it's not *certain* that B
(or p1) caused A (or p2), it's *possible*.  Contrast with the case where
they overlap, which could not possibly happen if the application were
trying to ensure order.  In the distributed-system literature, this is
often referred to as a causal relationship even though it's really just
the possibility of one, because in most cases even the possibility means
that reordering would be unacceptable.

> So, current write-behind is agnostic to the
> ordering of p1 and p2 (when done by two threads).
>
> However if p1 and p2 are issued by same thread there is _always_ a causal
> relationship (p2 being caused by p1).

See above.  If we feel bound to respect causal relationships, we have to
be pessimistic and assume that wherever such a relationship *could* exist
it *does* exist.  However, as I explained in my previous message, I don't
think it's practical to provide such a guarantee across multiple clients,
and if we don't provide it across multiple clients then it's not worth
much to provide it on a single client.  Applications that require such
strict ordering shouldn't use write-behind, or should explicitly flush
between writes.  Otherwise they'll break unexpectedly when parts are
distributed across multiple nodes.  Assuming that everything runs on one
node is the same mistake POSIX makes.  The assumption was appropriate
for an earlier era, but not now for a decade or more.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-23 Thread Raghavendra G

On Wed, Sep 21, 2016 at 10:58 PM, Raghavendra Talur 
wrote:

>
>
> On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler  wrote:
>
>> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:
>>
>>> Hi all,
>>>
>>> This mail is to figure out the behavior of write to same file from two
>>> different fds. As Ryan quotes in one of comments,
>>>
>>> 
>>>
>>> I think it’s not safe. in this case:
>>> 1. P1 write to F1 use FD1
>>> 2. after P1 write finish, P2 write to the same place use FD2
>>> since they are not conflict with each other now, the order the 2 writes
>>> send to underlying fs is not determined. so the final data may be P1’s or
>>> P2’s.
>>> this semantics is not the same with linux buffer io. linux buffer io
>>> will make the second write cover the first one, this is to say the final
>>> data is P2’s.
>>> you can see it from linux NFS (as we are all network filesystem)
>>> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request
>>> first before another write begin. the way 2 request is determine to be
>>> ‘incompatible’ is that they are from 2 different open fds.
>>> I think write-behind behaviour should keep the same with linux page
>>> cache.
>>>
>>> 
>>>
>>
>> I think that how this actually would work is that both would be written
>> to the same page in the page cache (if not using buffered IO), so as long
>> as they do not happen at the same time, you would get the second P2 copy of
>> data each time.
>>
>
> I apologize if my understanding is wrong but IMO this is exactly what we
> do in write-behind too. The cache is inode based and ensures that writes
> are ordered irrespective of the FD used for the write.
>
>
> Here is the commit message which brought the change
> 
> -
> write-behind: implement causal ordering and other cleanup
>
>
> Rules of causal ordering implemented:¬
>
>
>
>
>
>
>  - If request A arrives after the acknowledgement (to the app,¬
>
>i.e, STACK_UNWIND) of another request B, then request B is¬
>
>said to have 'caused' request A.¬
>
>

With the above principle, two write requests (p1 and p2 in example above)
issued by _two different threads/processes_ there need _not always_ be a
'causal' relationship (whether there is a causal relationship is purely
based on the "chance" that write-behind chose to ack one/both of them and
their timing of arrival). So, current write-behind is agnostic to the
ordering of p1 and p2 (when done by two threads).

However if p1 and p2 are issued by same thread there is _always_ a causal
relationship (p2 being caused by p1).


>
>
> - (corollary) Two requests, which at any point of time, are¬
>
>unacknowledged simultaneously in the system can never 'cause'¬
>
>each other (wb_inode->gen is based on this)¬
>
>
>
>  - If request A is caused by request B, AND request A's region¬
>
>has an overlap with request B's region, then then the fulfillment¬
>
>of request A is guaranteed to happen after the fulfillment of B.¬
>
>
>
>  - FD of origin is not considered for the determination of causal¬
>
>ordering.¬
>
>
>
>  - Append operation's region is considered the whole file.¬
>
>
>
>  Other cleanup:¬
>
>
>
>  - wb_file_t not required any more.¬
>
>
>
>  - wb_local_t not required any more.¬
>
>
>
>  - O_RDONLY fd's operations now go through the queue to make sure¬
>
>writes in the requested region get fulfilled be
> 
> ---
>
> Thanks,
> Raghavendra Talur
>
>
>>
>> Same story for using O_DIRECT - that write bypasses the page cache and
>> will update the data directly.
>>
>> What might happen in practice though is that your applications might use
>> higher level IO routines and they might buffer data internally. If that
>> happens, there is no ordering that is predictable.
>>
>> Regards,
>>
>> Ric
>>
>>
>>
>>> However, my understanding is that filesystems need not maintain the
>>> relative order of writes (as it received from vfs/kernel) on two different
>>> fds. Also, if we have to maintain the order it might come with increased
>>> latency. The increased latency can be because of having "newer" writes to
>>> wait on "older" ones. This wait can fill up write-behind buffer and can
>>> eventually result in a full write-behind cache and hence not able to
>>> "write-back" newer writes.
>>>
>>> * What does POSIX say about it?
>>> * How do other filesystems behave in this scenario?
>>>
>>>
>>> Also, the current write-behind implementation has the concept of
>>> "generation numbers". To quote from comment:
>>>
>>> 
>>>
>>>  uint64_t gen;/* Liability generation number. Represents
>>>  the current 'state' of liability. Every
>>>  new addition to the liability list bumps
>>>  the generation number.
>>>
>>>
>>>

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-22 Thread Jeff Darcy

> I don't understand the Jeff snippet above - if they are
> non-overlapping writes to dfferent offsets, this would never happen.

The question is not whether it *would* happen, but whether it would be
*allowed* to happen, and my point is that POSIX is often a poor guide.
Sometimes it's unreasonably strict, sometimes it's very lax.

That said, my example was kind of bad because it doesn't actually work
unless issues of durability are brought in.  Let's say that there's a
crash between the writes and the reads.  (It's not even clear when POSIX
would consider a distributed system to have crashed.  Let's just say
*everything* dies.)  While the strict write requirements apply to the
non-durable state before it's flushed, and thus affect what gets
flushed when writes overlap, it's entirely permissible for
non-overlapping writes to be flushed out of order.  This is even quite
likely if the writes are on different file descriptors.

http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html

> If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily
> on the conformance document to tell the user what can be expected from
> the system. It is explicitly intended that a null implementation is
> permitted.

That's my absolute favorite part of POSIX, by the way.  It amounts to
"do whatever you want" in standards language.  What this really means is
that, when the system comes back up, the results of the second write
could be available even though the first was lost.  I'm not saying it
happens.  I'm not saying it's good or useful behavior.  I'm just saying
the standard permits it.

> If they are the same offset and at the same time, then you can have an
> undefined results where you might get fragments of A and fragments of
> B (where you might be able to see some odd things if the write spans
> pages/blocks).

This is where POSIX goes the other way and *over*specifies behavior.
Normal linearizability requires that an action appear to be atomic at
*some* point between issuance and completion.  However, the POSIX "after
a write" wording forces this to be at the exact moment of completion.
It's not undefined.  If two writes overlap in both space and time, the
one that completes last *must* win.  Those "odd things" you mention
might be considered non-conformance with the standard.

Fortunately, Linux is not POSIX.  Linus and others have been quite clear
on that.  As much as I've talked about formal standards here, "what you
can get away with" is the real standard.  The page-cache behavior that
local filesystems rely on is IMO a poor guide, because extending that
behavior across physical systems is difficult to do completely and
impossible to do without impacting performance.  What matters is whether
users will accept this kind of reordering.  Here's what I think:

 (1) An expectation of ordering is only valid if the order is completely
 unambiguous.

 (2) This can only be the case if there was some coordination between
 when the first write completes and when the second is issued.

 (3) The coordinating entities could be on different machines, in which
 case the potential for reordering is unavoidable (short of us
 adding write-behind serialization across all clients).

 (4) If it's unavoidable in the distributed case, there's not much value
 in trying to make it airtight in the local case.

In other words, standards aside, I'm kind of with Raghavendra on this.
We shouldn't add this much complexity and possibly degrade performance
unless we can provide a *meaningful guarantee* to users, and this area
is already such a swamp that any user relying on particular behavior is
likely to get themselves in trouble no matter what we do.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-21 Thread Ric Wheeler


On 09/21/2016 08:58 PM, Jeff Darcy wrote:

However, my understanding is that filesystems need not maintain the relative
order of writes (as it received from vfs/kernel) on two different fds. Also,
if we have to maintain the order it might come with increased latency. The
increased latency can be because of having "newer" writes to wait on "older"
ones. This wait can fill up write-behind buffer and can eventually result in
a full write-behind cache and hence not able to "write-back" newer writes.

IEEE 1003.1, 2013 edition
http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html


After a write() to a regular file has successfully returned:

Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write()
for that position until >such byte positions are again modified.

Any subsequent successful write() to the same byte position in the
file shall overwrite that file data.

Note that the reference is to a *file*, not to a file *descriptor*.
It's an application of the general POSIX assumption that time is
simple, locking is cheap (if it's even necessary), and therefore
time-based requirements like linearizability - what this is - are
easy to satisfy.  I know that's not very realistic nowadays, but
it's pretty clear: according to the standard as it's still written,
P2's write *is* required to overwrite P1's.  Same vs. different fd
or process/thread doesn't even come into play.

Just for fun, I'll point out that the standard snippet above
doesn't say anything about *non overlapping* writes.  Does POSIX
allow the following?

write A
write B
read B, get new value
read A, get *old* value

This is a non-linearizable result, which would surely violate
some people's (notably POSIX authors') expectations, but good
luck finding anything in that standard which actually precludes
it.



I will reply to both comments here.

First, I think that all file systems will perform this way since this is really 
a function of how the page cache works and O_DIRECT.


More broadly, this is not a promise or hard and fast thing - the traditional way 
applications that do concurrent writes is to make sure that they use either 
whole file or byte range locking when one or more threads/processes are doing IO 
to the same file concurrently.


I don't understand the Jeff snippet above - if they are non-overlapping writes 
to dfferent offsets, this would never happen.


If the writes are to the same offset and happened at different times, it would 
not happen either.


If they are the same offset and at the same time, then you can have an undefined 
results where you might get fragments of A and fragments of B (where you might 
be able to see some odd things if the write spans pages/blocks).


This last case is where the normal best practice comes in to suggest using 
locking.

Ric


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-21 Thread Raghavendra Gowdappa



- Original Message -
> From: "Ric Wheeler" <ricwhee...@gmail.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>
> Cc: "ryan ding" <ryan.d...@open-fs.com>
> Sent: Wednesday, September 21, 2016 6:32:29 PM
> Subject: Re: [Gluster-devel] relative ordering of writes to same file from 
> two different fds
> 
> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:
> > Hi all,
> >
> > This mail is to figure out the behavior of write to same file from two
> > different fds. As Ryan quotes in one of comments,
> >
> > 
> >
> > I think it’s not safe. in this case:
> > 1. P1 write to F1 use FD1
> > 2. after P1 write finish, P2 write to the same place use FD2
> > since they are not conflict with each other now, the order the 2 writes
> > send to underlying fs is not determined. so the final data may be P1’s or
> > P2’s.
> > this semantics is not the same with linux buffer io. linux buffer io will
> > make the second write cover the first one, this is to say the final data
> > is P2’s.
> > you can see it from linux NFS (as we are all network filesystem)
> > fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request
> > first before another write begin. the way 2 request is determine to be
> > ‘incompatible’ is that they are from 2 different open fds.
> > I think write-behind behaviour should keep the same with linux page cache.
> >
> > 
> 
> I think that how this actually would work is that both would be written to
> the
> same page in the page cache (if not using buffered IO), so as long as they do
> not happen at the same time, you would get the second P2 copy of data each
> time.
> 
> Same story for using O_DIRECT - that write bypasses the page cache and will
> update the data directly.
> 
> What might happen in practice though is that your applications might use
> higher
> level IO routines and they might buffer data internally. If that happens,
> there
> is no ordering that is predictable.

Thanks Ric.

1. Are filesytems required to maintain that order?
2. Even if there is no such requirement, would there be any benefit in 
filesystems enforcing that order (probably at the cost of increased latency).

regards,
Raghavendra

> 
> Regards,
> 
> Ric
> 
> >
> > However, my understanding is that filesystems need not maintain the
> > relative order of writes (as it received from vfs/kernel) on two different
> > fds. Also, if we have to maintain the order it might come with increased
> > latency. The increased latency can be because of having "newer" writes to
> > wait on "older" ones. This wait can fill up write-behind buffer and can
> > eventually result in a full write-behind cache and hence not able to
> > "write-back" newer writes.
> >
> > * What does POSIX say about it?
> > * How do other filesystems behave in this scenario?
> >
> >
> > Also, the current write-behind implementation has the concept of
> > "generation numbers". To quote from comment:
> >
> > 
> >
> >  uint64_t gen;/* Liability generation number. Represents
> >  the current 'state' of liability. Every
> >  new addition to the liability list bumps
> >  the generation number.
> > 
> > 
> >  
> >  a newly arrived request is only required
> >  to perform causal checks against the
> >  entries
> >  in the liability list which were present
> >  at the time of its addition. the
> >  generation
> >  number at the time of its addition is
> >  stored
> >  in the request and used during checks.
> > 
> > 
> >  
> >  the liability list can grow while the
> >  request
> >  waits in the todo list waiting for i

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-21 Thread Jeff Darcy

> However, my understanding is that filesystems need not maintain the relative
> order of writes (as it received from vfs/kernel) on two different fds. Also,
> if we have to maintain the order it might come with increased latency. The
> increased latency can be because of having "newer" writes to wait on "older"
> ones. This wait can fill up write-behind buffer and can eventually result in
> a full write-behind cache and hence not able to "write-back" newer writes.

IEEE 1003.1, 2013 edition
http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

> After a write() to a regular file has successfully returned:
> 
> Any successful read() from each byte position in the file that was
> modified by that write shall return the data specified by the write()
> for that position until >such byte positions are again modified.
>
> Any subsequent successful write() to the same byte position in the
> file shall overwrite that file data.

Note that the reference is to a *file*, not to a file *descriptor*.
It's an application of the general POSIX assumption that time is
simple, locking is cheap (if it's even necessary), and therefore
time-based requirements like linearizability - what this is - are
easy to satisfy.  I know that's not very realistic nowadays, but
it's pretty clear: according to the standard as it's still written,
P2's write *is* required to overwrite P1's.  Same vs. different fd
or process/thread doesn't even come into play.

Just for fun, I'll point out that the standard snippet above
doesn't say anything about *non overlapping* writes.  Does POSIX
allow the following?

   write A
   write B
   read B, get new value
   read A, get *old* value

This is a non-linearizable result, which would surely violate
some people's (notably POSIX authors') expectations, but good
luck finding anything in that standard which actually precludes
it.

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-21 Thread Raghavendra Talur

On Wed, Sep 21, 2016 at 6:32 PM, Ric Wheeler  wrote:

> On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:
>
>> Hi all,
>>
>> This mail is to figure out the behavior of write to same file from two
>> different fds. As Ryan quotes in one of comments,
>>
>> 
>>
>> I think it’s not safe. in this case:
>> 1. P1 write to F1 use FD1
>> 2. after P1 write finish, P2 write to the same place use FD2
>> since they are not conflict with each other now, the order the 2 writes
>> send to underlying fs is not determined. so the final data may be P1’s or
>> P2’s.
>> this semantics is not the same with linux buffer io. linux buffer io will
>> make the second write cover the first one, this is to say the final data is
>> P2’s.
>> you can see it from linux NFS (as we are all network filesystem)
>> fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request
>> first before another write begin. the way 2 request is determine to be
>> ‘incompatible’ is that they are from 2 different open fds.
>> I think write-behind behaviour should keep the same with linux page cache.
>>
>> 
>>
>
> I think that how this actually would work is that both would be written to
> the same page in the page cache (if not using buffered IO), so as long as
> they do not happen at the same time, you would get the second P2 copy of
> data each time.
>

I apologize if my understanding is wrong but IMO this is exactly what we do
in write-behind too. The cache is inode based and ensures that writes are
ordered irrespective of the FD used for the write.


Here is the commit message which brought the change
-
write-behind: implement causal ordering and other cleanup


Rules of causal ordering implemented:¬






 - If request A arrives after the acknowledgement (to the app,¬

   i.e, STACK_UNWIND) of another request B, then request B is¬

   said to have 'caused' request A.¬



- (corollary) Two requests, which at any point of time, are¬

   unacknowledged simultaneously in the system can never 'cause'¬

   each other (wb_inode->gen is based on this)¬



 - If request A is caused by request B, AND request A's region¬

   has an overlap with request B's region, then then the fulfillment¬

   of request A is guaranteed to happen after the fulfillment of B.¬



 - FD of origin is not considered for the determination of causal¬

   ordering.¬



 - Append operation's region is considered the whole file.¬



 Other cleanup:¬



 - wb_file_t not required any more.¬



 - wb_local_t not required any more.¬



 - O_RDONLY fd's operations now go through the queue to make sure¬

   writes in the requested region get fulfilled be
---

Thanks,
Raghavendra Talur


>
> Same story for using O_DIRECT - that write bypasses the page cache and
> will update the data directly.
>
> What might happen in practice though is that your applications might use
> higher level IO routines and they might buffer data internally. If that
> happens, there is no ordering that is predictable.
>
> Regards,
>
> Ric
>
>
>
>> However, my understanding is that filesystems need not maintain the
>> relative order of writes (as it received from vfs/kernel) on two different
>> fds. Also, if we have to maintain the order it might come with increased
>> latency. The increased latency can be because of having "newer" writes to
>> wait on "older" ones. This wait can fill up write-behind buffer and can
>> eventually result in a full write-behind cache and hence not able to
>> "write-back" newer writes.
>>
>> * What does POSIX say about it?
>> * How do other filesystems behave in this scenario?
>>
>>
>> Also, the current write-behind implementation has the concept of
>> "generation numbers". To quote from comment:
>>
>> 
>>
>>  uint64_t gen;/* Liability generation number. Represents
>>  the current 'state' of liability. Every
>>  new addition to the liability list bumps
>>  the generation number.
>>
>>
>> a newly arrived request
>> is only required
>>  to perform causal checks against the
>> entries
>>  in the liability list which were present
>>  at the time of its addition. the
>> generation
>>  number at the time of its addition is
>> stored
>>  in the request and used during checks.
>>
>>
>> the liability list can
>> grow while the request
>>  waits in the todo list waiting for its
>>  dependent operations to complete. however
>>  it is

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-21 Thread Ric Wheeler

On 09/21/2016 08:06 AM, Raghavendra Gowdappa wrote:

Hi all,

This mail is to figure out the behavior of write to same file from two
different fds. As Ryan quotes in one of comments,

I think it’s not safe. in this case:
1. P1 write to F1 use FD1
2. after P1 write finish, P2 write to the same place use FD2
since they are not conflict with each other now, the order the 2 writes send to
underlying fs is not determined. so the final data may be P1’s or P2’s.
this semantics is not the same with linux buffer io. linux buffer io will make
the second write cover the first one, this is to say the final data is P2’s.
you can see it from linux NFS (as we are all network filesystem)
fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request first
before another write begin. the way 2 request is determine to be ‘incompatible’
is that they are from 2 different open fds.
I think write-behind behaviour should keep the same with linux page cache.

I think that how this actually would work is that both would be written to the
same page in the page cache (if not using buffered IO), so as long as they do
not happen at the same time, you would get the second P2 copy of data each time.

Same story for using O_DIRECT - that write bypasses the page cache and will
update the data directly.

What might happen in practice though is that your applications might use higher
level IO routines and they might buffer data internally. If that happens, there
is no ordering that is predictable.

Regards,

Ric

However, my understanding is that filesystems need not maintain the relative order of writes (as it received
from vfs/kernel) on two different fds. Also, if we have to maintain the order it might come with increased
latency. The increased latency can be because of having "newer" writes to wait on "older"
ones. This wait can fill up write-behind buffer and can eventually result in a full write-behind cache and
hence not able to "write-back" newer writes.

* What does POSIX say about it?
* How do other filesystems behave in this scenario?

Also, the current write-behind implementation has the concept of "generation
numbers". To quote from comment:

uint64_t gen;/* Liability generation number. Represents
the current 'state' of liability. Every
new addition to the liability list bumps
the generation number.

a newly arrived request is only required

to perform causal checks against the entries
in the liability list which were present
at the time of its addition. the generation
number at the time of its addition is stored
in the request and used during checks.

the liability list can grow while the request

waits in the todo list waiting for its
dependent operations to complete. however
it is not of the request's concern to depend
itself on those new entries which arrived
after it arrived (i.e, those that have a
liability generation higher than itself)
*/

So, if a single thread is doing writes on two different fds, generation numbers
are sufficient to enforce the relative ordering. If writes are from two
different threads/processes, I think write-behind is not obligated to maintain
their order. Comments?

[1] http://review.gluster.org/#/c/15380/

regards,
Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-20 Thread Raghavendra Gowdappa

Hi all,

This mail is to figure out the behavior of write to same file from two 
different fds. As Ryan quotes in one of comments,



I think it’s not safe. in this case:
1. P1 write to F1 use FD1
2. after P1 write finish, P2 write to the same place use FD2
since they are not conflict with each other now, the order the 2 writes send to 
underlying fs is not determined. so the final data may be P1’s or P2’s.
this semantics is not the same with linux buffer io. linux buffer io will make 
the second write cover the first one, this is to say the final data is P2’s.
you can see it from linux NFS (as we are all network filesystem) 
fs/nfs/file.c:nfs_write_begin(), nfs will flush ‘incompatible’ request first 
before another write begin. the way 2 request is determine to be ‘incompatible’ 
is that they are from 2 different open fds.
I think write-behind behaviour should keep the same with linux page cache.



However, my understanding is that filesystems need not maintain the relative 
order of writes (as it received from vfs/kernel) on two different fds. Also, if 
we have to maintain the order it might come with increased latency. The 
increased latency can be because of having "newer" writes to wait on "older" 
ones. This wait can fill up write-behind buffer and can eventually result in a 
full write-behind cache and hence not able to "write-back" newer writes.

* What does POSIX say about it?
* How do other filesystems behave in this scenario?


Also, the current write-behind implementation has the concept of "generation 
numbers". To quote from comment:



uint64_t gen;/* Liability generation number. Represents 


the current 'state' of liability. Every 


new addition to the liability list bumps


the generation number.  





a newly arrived request is only required


to perform causal checks against the entries


in the liability list which were present


at the time of its addition. the generation 


number at the time of its addition is stored


in the request and used during checks.  





the liability list can grow while the request   


waits in the todo list waiting for its  


dependent operations to complete. however   


it is not of the request's concern to depend


itself on those new entries which arrived   


after it arrived (i.e, those that have a


liability generation higher than itself)


 */


So, if a single thread is doing writes on two different fds, generation numbers 
are sufficient to enforce the relative ordering. If

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

[Gluster-devel] relative ordering of writes to same file from two different fds

12 matches

Site Navigation

Mail list logo

Footer information