Re: shelf-tests failure on macOS
Julian Foad wrote on 2018-06-15: > Bert Huijben wrote on 2018-06-14: > > On the testsuite we explicitly disable the sleeps via the environment > > variable designed for this (except for a few specific tests). We > > compensate for this problem in quite a few places by making sure that we > > also change the number of characters in the file. > > Thank you Bert! That was one of the remaining pieces of the puzzle. > After adding 'sleep for timestamps' in the unshelve code in r1833532, I > have again changed the tests to use files that are all different lengths > in r1833565, so the tests can work with the sleep disabled. > > Then I fixed one more timestamp-related issue in r1833577: we can't > depend on timestamp-ordered output of 'svn shelves'. One more follow-up: http://svn.apache.org/r1834125 : Shelving: re-enable a test, following sleep-for-timestamps fix. - Julian
Re: shelf-tests failure on macOS
Bert Huijben wrote on 2018-06-14: > On the testsuite we explicitly disable the sleeps via the environment > variable designed for this (except for a few specific tests). We > compensate for this problem in quite a few places by making sure that we > also change the number of characters in the file. Thank you Bert! That was one of the remaining pieces of the puzzle. After adding 'sleep for timestamps' in the unshelve code in r1833532, I have again changed the tests to use files that are all different lengths in r1833565, so the tests can work with the sleep disabled. Then I fixed one more timestamp-related issue in r1833577: we can't depend on timestamp-ordered output of 'svn shelves'. - Julian > > -Original Message- > > From: Julian Foad [mailto:julianf...@apache.org] > > Sent: donderdag 14 juni 2018 19:35 > > To: Philip Martin > > Cc: dev@subversion.apache.org > > Subject: Re: shelf-tests failure on macOS > > > > Philip Martin wrote on 2018-06-14: > > > Philip Martin writes: > > > > Julian Foad writes: > > > >> The most unusual thing about these failing tests is the data they > > > >> write. Strings of 5 or 6 low-value bytes, like: > > > >> > > > >> '\0\1\2\3\4\5' > > > >> '\5\4\3\2\1\0' > > > > > > > > Those are the same size. Could it be a filesystem timestamp resolution > > > > issue? The file changes but the timestamp does not and some change is > > > > not detected? > > > > > > An HFS+ filesystem has a coarse 1s timestamp resolution so that would > > > explain why the problem shows up there but not systems with subsecond > > > resolution. > > > > Sounds like the correct explanation. > > > > > Neither shelve or unshelve appear to use svn_sleep_for_timestamp(). > > > > Actually shelve already does, in the relevant code path, which is inside > > svn_client_shelf_unapply() where it uses svn_client_revert4() which does. > > > > Unshelve sleeps now, since r1833532. Hopefully that's the end of it. > > > > Thank you very much for the insight, Philip! > > > > - Julian > -- - Julian
RE: shelf-tests failure on macOS
On the testsuite we explicitly disable the sleeps via the environment variable designed for this (except for a few specific tests). We compensate for this problem in quite a few places by making sure that we also change the number of characters in the file. Bert > -Original Message- > From: Julian Foad [mailto:julianf...@apache.org] > Sent: donderdag 14 juni 2018 19:35 > To: Philip Martin > Cc: dev@subversion.apache.org > Subject: Re: shelf-tests failure on macOS > > Philip Martin wrote on 2018-06-14: > > Philip Martin writes: > > > Julian Foad writes: > > >> The most unusual thing about these failing tests is the data they > > >> write. Strings of 5 or 6 low-value bytes, like: > > >> > > >> '\0\1\2\3\4\5' > > >> '\5\4\3\2\1\0' > > > > > > Those are the same size. Could it be a filesystem timestamp resolution > > > issue? The file changes but the timestamp does not and some change is > > > not detected? > > > > An HFS+ filesystem has a coarse 1s timestamp resolution so that would > > explain why the problem shows up there but not systems with subsecond > > resolution. > > Sounds like the correct explanation. > > > Neither shelve or unshelve appear to use svn_sleep_for_timestamp(). > > Actually shelve already does, in the relevant code path, which is inside > svn_client_shelf_unapply() where it uses svn_client_revert4() which does. > > Unshelve sleeps now, since r1833532. Hopefully that's the end of it. > > Thank you very much for the insight, Philip! > > - Julian
Re: shelf-tests failure on macOS
Philip Martin wrote on 2018-06-14: > Philip Martin writes: > > Julian Foad writes: > >> The most unusual thing about these failing tests is the data they > >> write. Strings of 5 or 6 low-value bytes, like: > >> > >> '\0\1\2\3\4\5' > >> '\5\4\3\2\1\0' > > > > Those are the same size. Could it be a filesystem timestamp resolution > > issue? The file changes but the timestamp does not and some change is > > not detected? > > An HFS+ filesystem has a coarse 1s timestamp resolution so that would > explain why the problem shows up there but not systems with subsecond > resolution. Sounds like the correct explanation. > Neither shelve or unshelve appear to use svn_sleep_for_timestamp(). Actually shelve already does, in the relevant code path, which is inside svn_client_shelf_unapply() where it uses svn_client_revert4() which does. Unshelve sleeps now, since r1833532. Hopefully that's the end of it. Thank you very much for the insight, Philip! - Julian
Re: shelf-tests failure on macOS
On 14.06.2018 18:09, Philip Martin wrote: > Philip Martin writes: > >> Julian Foad writes: >> >>> The most unusual thing about these failing tests is the data they >>> write. Strings of 5 or 6 low-value bytes, like: >>> >>> '\0\1\2\3\4\5' >>> '\5\4\3\2\1\0' >> Those are the same size. Could it be a filesystem timestamp resolution >> issue? The file changes but the timestamp does not and some change is >> not detected? > An HFS+ filesystem has a coarse 1s timestamp resolution so that would > explain why the problem shows up there but not systems with subsecond > resolution. > > Neither shelve or unshelve appear to use svn_sleep_for_timestamp(). Oh, bravo! This could indeed be the problem. -- Brane
Re: shelf-tests failure on macOS
Philip Martin writes: > Julian Foad writes: > >> The most unusual thing about these failing tests is the data they >> write. Strings of 5 or 6 low-value bytes, like: >> >> '\0\1\2\3\4\5' >> '\5\4\3\2\1\0' > > Those are the same size. Could it be a filesystem timestamp resolution > issue? The file changes but the timestamp does not and some change is > not detected? An HFS+ filesystem has a coarse 1s timestamp resolution so that would explain why the problem shows up there but not systems with subsecond resolution. Neither shelve or unshelve appear to use svn_sleep_for_timestamp(). -- Philip
Re: shelf-tests failure on macOS
Julian Foad writes: > The most unusual thing about these failing tests is the data they > write. Strings of 5 or 6 low-value bytes, like: > > '\0\1\2\3\4\5' > '\5\4\3\2\1\0' Those are the same size. Could it be a filesystem timestamp resolution issue? The file changes but the timestamp does not and some change is not detected? -- Philip
Re: shelf-tests failure on macOS
Daniel Shahaf wrote on 2018-06-14: > Julian Foad wrote on Thu, 14 Jun 2018 16:06 +0100: > > Branko Čibej wrote on 2018-06-14: > > > A sync (which forces a write to disk) should not be necessary in cases > > > like this one, as long as we close the file in Python — that should > > > ensure a write to the OS buffers. If this were the problem, Subversion > > > wouldn't work on macOS at all. > > > > > > I still don't understand why this issue manifests only with the shelving > > > tests, > > > > The most unusual thing about these failing tests is the data they write. > > Strings of 5 or 6 low-value bytes, like: > > > > '\0\1\2\3\4\5' > > '\5\4\3\2\1\0' > > '\1\6\3\4\5' > > > > Just in case that gives anybody a clue. > > In the link in the first post, the expected data is '\0\1\2\3\4\5' and > the actual data is '\5\4\3\2\1\0'. It looks like a uint48_t written > with the wrong endianness. No, no -- these are the literal strings that I put in the Python test code, just as written here, in order to test shelving of 'binary' data. For example, one test starts with file content '\0\1\2\3\4\5' and changes it to '\5\4\3\2\1\0' and shelves that change. > Note that test 15 didn't fail in today's build, even though it did fail > for the last four days in a row. There don't seem to have been any > relevant commits yesterday. The exact location of the failures varies from one test run to another. Some tests fail at one point in one test run, then fail at a later or earlier point in another test run, but all failures seem to be of the same general form. - Julian
Re: shelf-tests failure on macOS
Julian Foad wrote on Thu, 14 Jun 2018 16:06 +0100: > Branko Čibej wrote on 2018-06-14: > > A sync (which forces a write to disk) should not be necessary in cases > > like this one, as long as we close the file in Python — that should > > ensure a write to the OS buffers. If this were the problem, Subversion > > wouldn't work on macOS at all. > > > > I still don't understand why this issue manifests only with the shelving > > tests, > > The most unusual thing about these failing tests is the data they write. > Strings of 5 or 6 low-value bytes, like: > > '\0\1\2\3\4\5' > '\5\4\3\2\1\0' > '\1\6\3\4\5' > > Just in case that gives anybody a clue. In the link in the first post, the expected data is '\0\1\2\3\4\5' and the actual data is '\5\4\3\2\1\0'. It looks like a uint48_t written with the wrong endianness. Note that test 15 didn't fail in today's build, even though it did fail for the last four days in a row. There don't seem to have been any relevant commits yesterday.
Re: shelf-tests failure on macOS
Branko Čibej wrote on 2018-06-14: > A sync (which forces a write to disk) should not be necessary in cases > like this one, as long as we close the file in Python — that should > ensure a write to the OS buffers. If this were the problem, Subversion > wouldn't work on macOS at all. > > I still don't understand why this issue manifests only with the shelving > tests, The most unusual thing about these failing tests is the data they write. Strings of 5 or 6 low-value bytes, like: '\0\1\2\3\4\5' '\5\4\3\2\1\0' '\1\6\3\4\5' Just in case that gives anybody a clue. - Julian
Re: shelf-tests failure on macOS
On 14.06.2018 16:50, Julian Foad wrote: > Branko Čibej wrote: >> The object lifetime and implicit close are very well defined indeed in >> CPython, which we use. We don't use "modern Pythons on other virtual >> machines." >> If this were indeed the source of the problem, a number of >> other tests would be failing randomly on that bot, and elsewhere too, >> I'm sure. But ... >> >>> with open(path, mode) as f: >>> f.write(contents) >> ... on the other hand, making this suggested change won't hurt [...] > Done ... and it didn't fix the problem. > >> It's quite likely a timing problem between cache flush and file open. >> For a while I was looking at the shelve code to see if there was a >> missing sync(), but that actually doesn't make sense in the context of >> the test suite, since 'svn' is invoked as an external process which ends >> before the tests continue > It's the other way around: > * Python writes to a file, > * Python calls 'svn' as a subprocess, > * 'svn' opens the file and doesn't see the change > > Could it be that we need to run 'sync' immediately *before* calling 'svn' as > a subprocess? A sync (which forces a write to disk) should not be necessary in cases like this one, as long as we close the file in Python — that should ensure a write to the OS buffers. If this were the problem, Subversion wouldn't work on macOS at all. I still don't understand why this issue manifests only with the shelving tests, even though we've used the open().write() pattern in our tests pretty much since they existed. The bot runs the build and tests on a RAM disk, but I do the same locally and don't see these issues. It could be a bug in the system-provided python on that particular version of macOS ... but again, we should have seen that with other tests, too. -- Brane
Re: shelf-tests failure on macOS
Branko Čibej wrote: > The object lifetime and implicit close are very well defined indeed in > CPython, which we use. We don't use "modern Pythons on other virtual > machines." > If this were indeed the source of the problem, a number of > other tests would be failing randomly on that bot, and elsewhere too, > I'm sure. But ... > > > with open(path, mode) as f: > > f.write(contents) > > ... on the other hand, making this suggested change won't hurt [...] Done ... and it didn't fix the problem. > It's quite likely a timing problem between cache flush and file open. > For a while I was looking at the shelve code to see if there was a > missing sync(), but that actually doesn't make sense in the context of > the test suite, since 'svn' is invoked as an external process which ends > before the tests continue It's the other way around: * Python writes to a file, * Python calls 'svn' as a subprocess, * 'svn' opens the file and doesn't see the change Could it be that we need to run 'sync' immediately *before* calling 'svn' as a subprocess? - Julian
Re: shelf-tests failure on macOS
Branko Čibej wrote on Thu, 14 Jun 2018 04:04 +0200: > ... on the other hand, making this suggested change won't hurt, as long > as our required Python version for tests supports the 'with' statement, > which I believe it does. In fact we already use the 'with' statement in > our test suite. Since 1.9 we require python>=2.7 which supports the 'with' statement.
Re: shelf-tests failure on macOS
On 14.06.2018 04:04, Branko Čibej wrote: > So, long story short ... changing all the open().write() statements to > the 'with' form should at least remove one possible source of bugs. We also use open().read() a lot, but that should not have timing problems since it doesn't modify the file. -- Brane
Re: shelf-tests failure on macOS
On 13.06.2018 16:31, Julian Foad wrote: > Branko Čibej wrote: >> On 12.06.2018 13:00, Julian Foad wrote: >>> Branko Čibej wrote: https://ci.apache.org/builders/svn-x64-macosx-apr1.3-nothread/builds/3416/steps/Test%20ra_local%2Bfsfs/logs/faillog >>> I should have a chance to debug this on a Mac today or tomorrow. >> The good news is that I can't reproduce this test failure locally. There >> are two possible reasons I can think of: [...] > I have been trying to reproduce and debug this with my colleague Konrad and > we have a theory. The failure mode seems consistent with Python writing a > change to disk and then running 'svn', and 'svn' not seeing that change on > disk. > > The test uses this code in svntest.main.file_write(): > > open(path, mode).write(contents) > > When opening and using a file object "in line" in this way, is seems that the > timing of the implicit "close" is not specified, according to sources such as: > > > https://stackoverflow.com/questions/5362901/in-the-inline-open-and-write-file-is-the-close-implicit#5362917 > > """ > The close() here happens when the file object is deallocated from memory, as > part of its deletion logic. Because modern Pythons on other virtual machines > — like Java and .NET — cannot control when an object is deallocated from > memory, it is no longer considered good Python to open() like this without a > close(). The recommendation today is to use a with statement, which > explicitly requests a close() when the block is exited ... > """ The object lifetime and implicit close are very well defined indeed in CPython, which we use. We don't use "modern Pythons on other virtual machines." If this were indeed the source of the problem, a number of other tests would be failing randomly on that bot, and elsewhere too, I'm sure. But ... > It seems that we need to change the code to something like: > > def file_write(path, contents, mode='w'): > [...] > with open(path, mode) as f: > f.write(contents) > > to guarantee the file is closed before the test proceeds further. It may be > that most of our testing in the past has used Python implementations where > the write is never delayed far enough to affect what an invoked 'svn' > subprocess sees, but on the Mac it is a little different. We are using a Mac > with Python 2. (The 'file_write' method has separate code paths for Python 2 > and Python 3 but both use the same open().write() construct.) ... on the other hand, making this suggested change won't hurt, as long as our required Python version for tests supports the 'with' statement, which I believe it does. In fact we already use the 'with' statement in our test suite. > The same construct appears in many other places in the test suite too. > > We are trying to test this, and it is taking a long time because we can only > currently reproduce the failure after a fresh checkout/build/test cycle. > Maybe because the bug is sensitive to the differences in timing of a warm > cache vs. a cold cache. It's quite likely a timing problem between cache flush and file open. For a while I was looking at the shelve code to see if there was a missing sync(), but that actually doesn't make sense in the context of the test suite, since 'svn' is invoked as an external process which ends before the tests continue — only a bug in the OS filesystem cache could cause the observed behaviour, and that sort of bug would be ... let's just say rather extremely visible with other applications, too. So, long story short ... changing all the open().write() statements to the 'with' form should at least remove one possible source of bugs. -- Brane
Re: shelf-tests failure on macOS
Branko Čibej wrote: > On 12.06.2018 13:00, Julian Foad wrote: >> Branko Čibej wrote: >>> https://ci.apache.org/builders/svn-x64-macosx-apr1.3-nothread/builds/3416/steps/Test%20ra_local%2Bfsfs/logs/faillog >>> >> I should have a chance to debug this on a Mac today or tomorrow. > > The good news is that I can't reproduce this test failure locally. There > are two possible reasons I can think of: [...] I have been trying to reproduce and debug this with my colleague Konrad and we have a theory. The failure mode seems consistent with Python writing a change to disk and then running 'svn', and 'svn' not seeing that change on disk. The test uses this code in svntest.main.file_write(): open(path, mode).write(contents) When opening and using a file object "in line" in this way, is seems that the timing of the implicit "close" is not specified, according to sources such as: https://stackoverflow.com/questions/5362901/in-the-inline-open-and-write-file-is-the-close-implicit#5362917 """ The close() here happens when the file object is deallocated from memory, as part of its deletion logic. Because modern Pythons on other virtual machines — like Java and .NET — cannot control when an object is deallocated from memory, it is no longer considered good Python to open() like this without a close(). The recommendation today is to use a with statement, which explicitly requests a close() when the block is exited ... """ It seems that we need to change the code to something like: def file_write(path, contents, mode='w'): [...] with open(path, mode) as f: f.write(contents) to guarantee the file is closed before the test proceeds further. It may be that most of our testing in the past has used Python implementations where the write is never delayed far enough to affect what an invoked 'svn' subprocess sees, but on the Mac it is a little different. We are using a Mac with Python 2. (The 'file_write' method has separate code paths for Python 2 and Python 3 but both use the same open().write() construct.) The same construct appears in many other places in the test suite too. We are trying to test this, and it is taking a long time because we can only currently reproduce the failure after a fresh checkout/build/test cycle. Maybe because the bug is sensitive to the differences in timing of a warm cache vs. a cold cache. - Julian
Re: shelf-tests failure on macOS
On 12.06.2018 13:00, Julian Foad wrote: > Branko Čibej wrote: >> https://ci.apache.org/builders/svn-x64-macosx-apr1.3-nothread/builds/3416/steps/Test%20ra_local%2Bfsfs/logs/faillog >> >> With the file contents now being printed in the test log, we can see how >> the test is failing. What's worrying is that we don't see this failure >> either on Windows or Linux (at least, I haven't seen it in the buildbot >> logs). > I should have a chance to debug this on a Mac today or tomorrow. The good news is that I can't reproduce this test failure locally. There are two possible reasons I can think of: * something's wrong with the version of APR used by the bot; it's 1.5.x compared to 1.6.x that I use; * the filesystem on the bot is broken; but it's a RAMdisk that gets recreated for each test run and it would be extremely strange if it caused only that test to fail consistently. -- Brane
Re: shelf-tests failure on macOS
Branko Čibej wrote: > https://ci.apache.org/builders/svn-x64-macosx-apr1.3-nothread/builds/3416/steps/Test%20ra_local%2Bfsfs/logs/faillog > > With the file contents now being printed in the test log, we can see how > the test is failing. What's worrying is that we don't see this failure > either on Windows or Linux (at least, I haven't seen it in the buildbot > logs). I should have a chance to debug this on a Mac today or tomorrow. - Julian