Re: [PATCH STABLE] changegroup: retry read() when partial data returned

2017-03-29 Thread Gregory Szorc


> On Mar 27, 2017, at 08:59, Yuya Nishihara  wrote:
> 
> On Mon, 27 Mar 2017 08:25:41 -0700, Gregory Szorc wrote:
 On Mar 27, 2017, at 08:13, Yuya Nishihara  wrote:
 On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
 # HG changeset patch
 # User Gregory Szorc 
 # Date 1490554582 25200
 #  Sun Mar 26 11:56:22 2017 -0700
 # Branch stable
 # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
 # Parent  ed5b25874d998ababb181a939dd37a16ea644435
 changegroup: retry read() when partial data returned
> 
> [...]
> 
 def readexactly(stream, n):
'''read n bytes from stream.read and abort if less was available'''
 -s = stream.read(n)
 -if len(s) < n:
 +# Most of the time, stream.read() returns exactly the number of bytes
 +# requested. Various Python APIs will perform multiple system calls
 +# to retrieve more bytes if the first call did not return enough. This
 +# includes cases where the system call is interrupted.
 +#
 +# Even with this retry logic in the Python APIs, we still see failure
 +# to retrieve the requested number of bytes. So, we build in our own
 +# retry layer here.
 +left = n
 +chunk = stream.read(n)
 +left -= len(chunk)
 +assert left >= 0
 +
 +# Common case is it just works.
 +if not left:
 +return chunk
 +
 +chunks = [chunk]
 +# Retry to get remaining data. In cases where the stream has errored 
 or
 +# is at EOF, this will do a bit of redundant work. But it helps 
 prevent
 +# intermittent failures and isn't common. So the overhead is 
 acceptable.
 +for i in range(3):
 +chunk = stream.read(left)
 +chunks.append(chunk)
 +left -= len(chunk)
 +assert left >= 0
 +
 +if not left:
 +break
>>> 
>>> Do you have some number that supports this change actually mitigates the
>>> problem?
>> 
>> No. I could try to get that this week.
> 
> Many thanks.

Upon further investigation, Mozilla's Windows machines install Mercurial 
through the .exe installer. Running a hacked version for testing will require 
me to create a custom installer. That rabbit hole is too deep for this week.

I'll try to change our automation to use pip/wheels then will try to look into 
this for the 4.2 cycle.

> 
>>> Suppose underlying functions have bugs, I don't think we can always get
>>> contiguous chunks with no data loss. If a few bytes lost for example, we
>>> could read arbitrary part as a length field, and call readexactly() with
>>> that invalid length value.
>> 
>> Hmm. My assumption is that any error that would result in data loss 
>> effectively poisons the stream and prevents future reads.
> 
> I was thinking e.g. some intermediate layer might not handle EINTR
> appropriately and EINTR might be caught by some upper layer. If it had
> a temporary buffer, its content would be lost.

Hmmm. I'm trying to think which system calls those could be. I may have to 
spend some time with strace and man pages...
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH STABLE] changegroup: retry read() when partial data returned

2017-03-27 Thread Yuya Nishihara
On Mon, 27 Mar 2017 08:25:41 -0700, Gregory Szorc wrote:
> > On Mar 27, 2017, at 08:13, Yuya Nishihara  wrote:
> >> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
> >> # HG changeset patch
> >> # User Gregory Szorc 
> >> # Date 1490554582 25200
> >> #  Sun Mar 26 11:56:22 2017 -0700
> >> # Branch stable
> >> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
> >> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
> >> changegroup: retry read() when partial data returned

[...]

> >> def readexactly(stream, n):
> >> '''read n bytes from stream.read and abort if less was available'''
> >> -s = stream.read(n)
> >> -if len(s) < n:
> >> +# Most of the time, stream.read() returns exactly the number of bytes
> >> +# requested. Various Python APIs will perform multiple system calls
> >> +# to retrieve more bytes if the first call did not return enough. This
> >> +# includes cases where the system call is interrupted.
> >> +#
> >> +# Even with this retry logic in the Python APIs, we still see failure
> >> +# to retrieve the requested number of bytes. So, we build in our own
> >> +# retry layer here.
> >> +left = n
> >> +chunk = stream.read(n)
> >> +left -= len(chunk)
> >> +assert left >= 0
> >> +
> >> +# Common case is it just works.
> >> +if not left:
> >> +return chunk
> >> +
> >> +chunks = [chunk]
> >> +# Retry to get remaining data. In cases where the stream has errored 
> >> or
> >> +# is at EOF, this will do a bit of redundant work. But it helps 
> >> prevent
> >> +# intermittent failures and isn't common. So the overhead is 
> >> acceptable.
> >> +for i in range(3):
> >> +chunk = stream.read(left)
> >> +chunks.append(chunk)
> >> +left -= len(chunk)
> >> +assert left >= 0
> >> +
> >> +if not left:
> >> +break
> > 
> > Do you have some number that supports this change actually mitigates the
> > problem?
> 
> No. I could try to get that this week.

Many thanks.

> > Suppose underlying functions have bugs, I don't think we can always get
> > contiguous chunks with no data loss. If a few bytes lost for example, we
> > could read arbitrary part as a length field, and call readexactly() with
> > that invalid length value.
> 
> Hmm. My assumption is that any error that would result in data loss 
> effectively poisons the stream and prevents future reads.

I was thinking e.g. some intermediate layer might not handle EINTR
appropriately and EINTR might be caught by some upper layer. If it had
a temporary buffer, its content would be lost.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH STABLE] changegroup: retry read() when partial data returned

2017-03-27 Thread Gregory Szorc


> On Mar 27, 2017, at 08:13, Yuya Nishihara  wrote:
> 
>> On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
>> # HG changeset patch
>> # User Gregory Szorc 
>> # Date 1490554582 25200
>> #  Sun Mar 26 11:56:22 2017 -0700
>> # Branch stable
>> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
>> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
>> changegroup: retry read() when partial data returned
>> 
>> We've seen a rash of "stream ended unexpectedly" errors in the
>> wild. This occurs when changegroup.readexactly() fails to retrieve
>> the exact number of requested bytes in a single stream.read()
>> operation.
>> 
>> There are several plausible theories that explain this behavior.
>> Many include underlying APIs not retrying automatically when
>> interrupted by a signal. While Python is supposed to do this,
>> it could be buggy. There could also be a bug in the various
>> stream reading layers between changegroup.readexactly() and the
>> underlying file descriptor.
>> 
>> Unfortunately, getting this error to reproduce has been extremely
>> difficult and no single cause has been identified.
>> 
>> Since networks (and even filesystems) are unreliable, it seems
>> wise to have some form of retry in place to mitigate this failure.
>> This patch adds that retry logic.
>> 
>> There is already basic test coverage that the exception in this
>> function is raised. The tests didn't change with this patch.
>> 
>> The only obvious negative impact to this change I can see is that
>> in cases where the read operation fails, there is some overhead
>> involved with retrying the operation. In the worst case, this is
>> pure overhead: we'd retry an operation and get the same outcome.
>> But in the best case, it avoids an intermittent/random failure.
>> My assumption is that the overhead to retry is small and that
>> the cost to avoiding a random failure is justified.
>> 
>> There are other changes I'd like to make to stream reading code
>> to help flush out this general failure of partial stream reads.
>> See issue4923 for some suggestions with regards to swallowing
>> exception and masking the underlying error. However, these
>> changes aren't suitable for the stable branch. This patch feels
>> minimally invasive and if successful would let us narrow our focus
>> for finding the underlying bug.
>> 
>> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
>> --- a/mercurial/changegroup.py
>> +++ b/mercurial/changegroup.py
>> @@ -35,12 +35,42 @@ from . import (
>> 
>> def readexactly(stream, n):
>> '''read n bytes from stream.read and abort if less was available'''
>> -s = stream.read(n)
>> -if len(s) < n:
>> +# Most of the time, stream.read() returns exactly the number of bytes
>> +# requested. Various Python APIs will perform multiple system calls
>> +# to retrieve more bytes if the first call did not return enough. This
>> +# includes cases where the system call is interrupted.
>> +#
>> +# Even with this retry logic in the Python APIs, we still see failure
>> +# to retrieve the requested number of bytes. So, we build in our own
>> +# retry layer here.
>> +left = n
>> +chunk = stream.read(n)
>> +left -= len(chunk)
>> +assert left >= 0
>> +
>> +# Common case is it just works.
>> +if not left:
>> +return chunk
>> +
>> +chunks = [chunk]
>> +# Retry to get remaining data. In cases where the stream has errored or
>> +# is at EOF, this will do a bit of redundant work. But it helps prevent
>> +# intermittent failures and isn't common. So the overhead is acceptable.
>> +for i in range(3):
>> +chunk = stream.read(left)
>> +chunks.append(chunk)
>> +left -= len(chunk)
>> +assert left >= 0
>> +
>> +if not left:
>> +break
> 
> Do you have some number that supports this change actually mitigates the
> problem?

No. I could try to get that this week.

> 
> Suppose underlying functions have bugs, I don't think we can always get
> contiguous chunks with no data loss. If a few bytes lost for example, we
> could read arbitrary part as a length field, and call readexactly() with
> that invalid length value.

Hmm. My assumption is that any error that would result in data loss effectively 
poisons the stream and prevents future reads.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH STABLE] changegroup: retry read() when partial data returned

2017-03-27 Thread Yuya Nishihara
On Sun, 26 Mar 2017 11:56:59 -0700, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc 
> # Date 1490554582 25200
> #  Sun Mar 26 11:56:22 2017 -0700
> # Branch stable
> # Node ID 414128ddd876eb33f70b0dd95d110e29a9308c93
> # Parent  ed5b25874d998ababb181a939dd37a16ea644435
> changegroup: retry read() when partial data returned
> 
> We've seen a rash of "stream ended unexpectedly" errors in the
> wild. This occurs when changegroup.readexactly() fails to retrieve
> the exact number of requested bytes in a single stream.read()
> operation.
> 
> There are several plausible theories that explain this behavior.
> Many include underlying APIs not retrying automatically when
> interrupted by a signal. While Python is supposed to do this,
> it could be buggy. There could also be a bug in the various
> stream reading layers between changegroup.readexactly() and the
> underlying file descriptor.
> 
> Unfortunately, getting this error to reproduce has been extremely
> difficult and no single cause has been identified.
> 
> Since networks (and even filesystems) are unreliable, it seems
> wise to have some form of retry in place to mitigate this failure.
> This patch adds that retry logic.
> 
> There is already basic test coverage that the exception in this
> function is raised. The tests didn't change with this patch.
> 
> The only obvious negative impact to this change I can see is that
> in cases where the read operation fails, there is some overhead
> involved with retrying the operation. In the worst case, this is
> pure overhead: we'd retry an operation and get the same outcome.
> But in the best case, it avoids an intermittent/random failure.
> My assumption is that the overhead to retry is small and that
> the cost to avoiding a random failure is justified.
> 
> There are other changes I'd like to make to stream reading code
> to help flush out this general failure of partial stream reads.
> See issue4923 for some suggestions with regards to swallowing
> exception and masking the underlying error. However, these
> changes aren't suitable for the stable branch. This patch feels
> minimally invasive and if successful would let us narrow our focus
> for finding the underlying bug.
> 
> diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
> --- a/mercurial/changegroup.py
> +++ b/mercurial/changegroup.py
> @@ -35,12 +35,42 @@ from . import (
>  
>  def readexactly(stream, n):
>  '''read n bytes from stream.read and abort if less was available'''
> -s = stream.read(n)
> -if len(s) < n:
> +# Most of the time, stream.read() returns exactly the number of bytes
> +# requested. Various Python APIs will perform multiple system calls
> +# to retrieve more bytes if the first call did not return enough. This
> +# includes cases where the system call is interrupted.
> +#
> +# Even with this retry logic in the Python APIs, we still see failure
> +# to retrieve the requested number of bytes. So, we build in our own
> +# retry layer here.
> +left = n
> +chunk = stream.read(n)
> +left -= len(chunk)
> +assert left >= 0
> +
> +# Common case is it just works.
> +if not left:
> +return chunk
> +
> +chunks = [chunk]
> +# Retry to get remaining data. In cases where the stream has errored or
> +# is at EOF, this will do a bit of redundant work. But it helps prevent
> +# intermittent failures and isn't common. So the overhead is acceptable.
> +for i in range(3):
> +chunk = stream.read(left)
> +chunks.append(chunk)
> +left -= len(chunk)
> +assert left >= 0
> +
> +if not left:
> +break

Do you have some number that supports this change actually mitigates the
problem?

Suppose underlying functions have bugs, I don't think we can always get
contiguous chunks with no data loss. If a few bytes lost for example, we
could read arbitrary part as a length field, and call readexactly() with
that invalid length value.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH STABLE] changegroup: retry read() when partial data returned

2017-03-26 Thread Gregory Szorc
On Sun, Mar 26, 2017 at 12:17 PM, Jun Wu  wrote:

> The direction looks good. I'd like to see more solid code. Commented below.
>
> Excerpts from Gregory Szorc's message of 2017-03-26 11:56:59 -0700:
> > +# Retry to get remaining data. In cases where the stream has
> errored or
> > +# is at EOF, this will do a bit of redundant work. But it helps
> prevent
> > +# intermittent failures and isn't common. So the overhead is
> acceptable.
> > +for i in range(3):
>
> If "stream.read" does not behave as expected, assume it's totally broken.
> So we need to reset the retry counter once we got data. Something like:
>
> def readexactly(..., retry=3):
> 
> retryremaining = retry
> while True:
> chunk = stream.read(left)
> if not chunk:
> retryremaining -= 1
> if not retryremaining:
> break
> retryremaining = retry
> left -= len(chunk)
> 
>

I agree with your reasoning here. However, I want to be conservative with
the patch since it is for stable. If a reviewer finds your approach
suitable for stable, I'll make the change. Otherwise, we can take the
conservative approach on stable and change to your version on default.

FWIW there are other things I may want to do on default, such as making the
retry count configurable and possibly adding some blackbox logging to help
us track down how this fails in the wild.

> +chunk = stream.read(left)
> > +chunks.append(chunk)
> > +left -= len(chunk)
> > +assert left >= 0
> > +
> > +if not left:
> > +break
> > +
> > +if left:
> >  raise error.Abort(_("stream ended unexpectedly"
> > " (got %d bytes, expected %d)")
> > -  % (len(s), n))
> > -return s
> > +  % (n - left, n))
> > +
> > +return b''.join(chunks)
> >
> >  def getchunk(stream):
> >  """return the next chunk from stream as a string"""
>
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH STABLE] changegroup: retry read() when partial data returned

2017-03-26 Thread Jun Wu
The direction looks good. I'd like to see more solid code. Commented below.

Excerpts from Gregory Szorc's message of 2017-03-26 11:56:59 -0700:
> +# Retry to get remaining data. In cases where the stream has errored or
> +# is at EOF, this will do a bit of redundant work. But it helps prevent
> +# intermittent failures and isn't common. So the overhead is acceptable.
> +for i in range(3):

If "stream.read" does not behave as expected, assume it's totally broken.
So we need to reset the retry counter once we got data. Something like:

def readexactly(..., retry=3):

retryremaining = retry
while True:
chunk = stream.read(left)
if not chunk:
retryremaining -= 1
if not retryremaining:
break
retryremaining = retry
left -= len(chunk)


> +chunk = stream.read(left)
> +chunks.append(chunk)
> +left -= len(chunk)
> +assert left >= 0
> +
> +if not left:
> +break
> +
> +if left:
>  raise error.Abort(_("stream ended unexpectedly"
> " (got %d bytes, expected %d)")
> -  % (len(s), n))
> -return s
> +  % (n - left, n))
> +
> +return b''.join(chunks)
>  
>  def getchunk(stream):
>  """return the next chunk from stream as a string"""
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel