[HACKERS] Beta 6 Regression results on Redat 7.0.
Ok, thanks to our snowstorm :-0 I have been working on the beta 6 RPM situation on my _slow_ notebook today (power outages for ten minutes at a time happening at hour or so intervals due to 45mph+ winds and a foot of snow). Well, I have preliminary RPM's built -- just need to work on the contrib tree situation. I ran regression the usual RPM way (which I am fully aware is not the normally approved method, but it _would_ be the method any RPM beta testers would use), and got a different failure, one that is not locale related (LC_ALL=C both for the initdb and the postmaster startup in the newest initscript). See attached regression.diffs for details of the temptest failure I experienced. Regression run with CWD=/usr/share/test/regress, user=postgres. ./pg_regress --schedule=parallel_schedule This is the only regression test failure I have found thus far. I have never seen this failure before, so I'm not sure where to proceed. Now to attack the contrib tree (looking forward to my new notebook, as this old P133 takes an hour and twenty minutes to slog through a full build). Seeing that RC1 is in prep, is there a pressing need to upload and release beta 6 RPM's, or will it be a day or two before RC1? -- Lamar Owen WGCR Internet Radio 1 Peter 4:11 *** ./expected/temp.out Sat Jan 8 22:48:39 2000 --- ./results/temp.out Tue Mar 20 16:06:10 2001 *** *** 23,32 (1 row) DROP TABLE temptest; SELECT * FROM temptest; col - !1 (1 row) DROP TABLE temptest; --- 23,34 (1 row) DROP TABLE temptest; + NOTICE: FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global +1) + ERROR: heap_drop_with_catalog: FlushRelationBuffers returned -2 SELECT * FROM temptest; col - !2 (1 row) DROP TABLE temptest; *** *** 34,37 -- test temp table deletion \c regression SELECT * FROM temptest; ! ERROR: Relation 'temptest' does not exist --- 36,43 -- test temp table deletion \c regression SELECT * FROM temptest; ! col ! - !1 ! (1 row) ! == ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
Lamar Owen [EMAIL PROTECTED] writes: DROP TABLE temptest; + NOTICE: FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1) + ERROR: heap_drop_with_catalog: FlushRelationBuffers returned -2 SELECT * FROM temptest; Hoo, that's interesting ... Exactly what fileset were you using again? Seeing that RC1 is in prep, is there a pressing need to upload and release beta 6 RPM's, or will it be a day or two before RC1? I think you might as well wait for RC1 as far as actually making RPMs goes. But do you want to let anyone else check out the RPM build process? For instance, I've been wondering what you did about the which-set-of-headers-to-install issue. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
On Tue, 20 Mar 2001, Lamar Owen wrote: Ok, thanks to our snowstorm :-0 I have been working on the beta 6 RPM situation on my _slow_ notebook today (power outages for ten minutes at a time happening at hour or so intervals due to 45mph+ winds and a foot of snow). Well, I have preliminary RPM's built -- just need to work on the contrib tree situation. I ran regression the usual RPM way (which I am fully aware is not the normally approved method, but it _would_ be the method any RPM beta testers would use), and got a different failure, one that is not locale related (LC_ALL=C both for the initdb and the postmaster startup in the newest initscript). See attached regression.diffs for details of the temptest failure I experienced. Regression run with CWD=/usr/share/test/regress, user=postgres. ./pg_regress --schedule=parallel_schedule This is the only regression test failure I have found thus far. I have never seen this failure before, so I'm not sure where to proceed. Now to attack the contrib tree (looking forward to my new notebook, as this old P133 takes an hour and twenty minutes to slog through a full build). Seeing that RC1 is in prep, is there a pressing need to upload and release beta 6 RPM's, or will it be a day or two before RC1? Im going to do RC1 tonight ... so no pressng need :) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
On Tue, 20 Mar 2001, Tom Lane wrote: Lamar Owen [EMAIL PROTECTED] writes: DROP TABLE temptest; + NOTICE: FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1) + ERROR: heap_drop_with_catalog: FlushRelationBuffers returned -2 SELECT * FROM temptest; Hoo, that's interesting ... Exactly what fileset were you using again? When you say 'fileset', I'm assuming you are referring to the --schedule parameter -- I am invoking the following command: ./pg_regress --schedule=parallel_schedule 7.1beta6 distribution tarball. LC_ALL=C. Compiled on RedHat 7 as shipped. I'm rerunning to see if it is intermittent. Second run -- no error. Running a third time..no error. Now I'm confused. What would cause such an error, Tom? I'm going to check on my desktop, once power gets more stable (and it quits lightning -- yes, a snowstorm with lightning :-0 I certainly got what I wanted.). So, more to come later. Seeing that RC1 is in prep, is there a pressing need to upload and release beta 6 RPM's, or will it be a day or two before RC1? I think you might as well wait for RC1 as far as actually making RPMs goes. But do you want to let anyone else check out the RPM build process? For instance, I've been wondering what you did about the which-set-of-headers-to-install issue. Oh, ok. Spec file attached. All other files needed are the beta6 tarball and the contents of the beta4-1 source rpm, with names changed to match the beta6 version number. There are some other changes I have to merge in -- particularly a set from Karl for the optional PL/Perl build, as well as others, so this is a preliminary spec file. But I was just getting the basic build done and tested. To directly answer your question, I'm using 'make install-all-headers' and stuffing it into the devel rpm in one piece at this time. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11 Summary: PostgreSQL client programs and libraries. Name: postgresql Version: 7.1beta6 Release: 0.2 License: BSD Group: Applications/Databases Source0: ftp://ftp.postgresql.org/pub/source/v%{version}/postgresql-%{version}.tar.gz Source3: postgresql.init-%{version} Source4: file-lists-pgsql-%{version}.tar.gz Source5: ftp://ftp.postgresql.org/pub/source/v%{version}/postgresql-%{version}.tar.gz.md5 Source6: README.rpm-dist.postgresql-%{version} Source7: pg-migration-scripts-%{version}.tar.gz Source8: logrotate.postgresql-%{version} Source10: http://www.retep.org.uk/postgres/jdbc7.0-1.1.jar Source11: http://www.retep.org.uk/postgres/jdbc7.0-1.2.jar Source12: postgresql-dump.1.gz Source14: rh-pgdump.sh Patch1: rpm-pgsql-%{version}.patch Requires: perl Prereq: /sbin/chkconfig /sbin/ldconfig /usr/sbin/useradd initscripts BuildPrereq: python-devel perl tcl /lib/cpp Url: http://www.postgresql.org/ Obsoletes: postgresql-clients Buildroot: %{_tmppath}/%{name}-%{version}-root # This is the PostgreSQL Global Development Group Official RPMset spec file. # Copyright 2000 Lamar Owen [EMAIL PROTECTED] [EMAIL PROTECTED] # and others listed. # Major Contributors: # --- # Lamar Owen # Trond Eivind Glomsrød [EMAIL PROTECTED] # Thomas Lockhart # This spec file and ancilliary files are licensed in accordance with # The PostgreSQL license. #Below are the default build package list macros. These can be overridden by defining # on the rpm command line: # rpm --define 'packagename 1' to force the package to build. # rpm --define 'packagename 0' to force the package NOT to build. # The base package, the lib package, the devel package, and the server package always get built. %{!?perl:%define perl 1} %{!?tcl:%define tcl 1} %{!?tkpkg:%define tkpkg %{expand:tcl}} %{!?odbc:%define odbc 1} %{!?jdbc:%define jdbc 1} %{!?test:%define test 1} %{!?python:%define python 1} %{!?pltcl:%define pltcl 1} %{!?plperl:%define plperl 1} # Utility feature defines. %{!?enable_mb:%define enable_mb 1} %{!?pgacess:%define pgaccess 1} %dump %description PostgreSQL is an advanced Object-Relational database management system (DBMS) that supports almost all SQL constructs (including transactions, subselects and user-defined types and functions). The postgresql package includes the client programs and libraries that you'll need to access a PostgreSQL DBMS server. These PostgreSQL client programs are programs that directly manipulate the internal structure of PostgreSQL databases on a PostgreSQL server. These client programs can be located on the same machine with the PostgreSQL server, or may be on a remote machine which accesses a PostgreSQL server over a network connection. This package contains the client libraries for C and C++, as well as command-line utilities for managing PostgreSQL databases on a PostgreSQL server. If you want to manipulate a PostgreSQL database on a remote PostgreSQL server, you need this package. You also need to install this package if you're installing the postgresql-server package. %package libs Summary: The
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
Lamar Owen [EMAIL PROTECTED] writes: DROP TABLE temptest; + NOTICE: FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1) + ERROR: heap_drop_with_catalog: FlushRelationBuffers returned -2 SELECT * FROM temptest; Hoo, that's interesting ... Exactly what fileset were you using again? When you say 'fileset', I'm assuming you are referring to the --schedule parameter -- No, I was wondering about whether you had an inconsistent set of source files, or had managed to not do a complete rebuild, or something like that. The above error should be entirely impossible considering that the table in question is a temp table that's not been touched by any other backend. If you did manage to get this from a clean build then I think we have a serious problem to look at. I think you might as well wait for RC1 as far as actually making RPMs goes. But do you want to let anyone else check out the RPM build process? For instance, I've been wondering what you did about the which-set-of-headers-to-install issue. Oh, ok. Spec file attached. All other files needed are the beta6 tarball and the contents of the beta4-1 source rpm, with names changed to match the beta6 version number. OK, I will pull the files and try to replicate this on my own laptop. Does anyone else have time to try to duplicate the problem tonight? If it's replicatable at all, I think it's a release stopper. To directly answer your question, I'm using 'make install-all-headers' and stuffing it into the devel rpm in one piece at this time. Works for me. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
On Tue, 20 Mar 2001, Tom Lane wrote: Lamar Owen [EMAIL PROTECTED] writes: DROP TABLE temptest; + NOTICE: FlushRelationBuffers(temptest, 0): block 0 is referenced (private 0, global 1) + ERROR: heap_drop_with_catalog: FlushRelationBuffers returned -2 SELECT * FROM temptest; Hoo, that's interesting ... Exactly what fileset were you using again? When you say 'fileset', I'm assuming you are referring to the --schedule parameter -- No, I was wondering about whether you had an inconsistent set of source files, or had managed to not do a complete rebuild, or something like that. The above error should be entirely impossible considering that the table in question is a temp table that's not been touched by any other backend. If you did manage to get this from a clean build then I think we have a serious problem to look at. Standard RPM rebuild -- always wipes the whole build tree out and re-expands from the tarball, reapplies patches, and rebuilds from scratch every time I change even the smallest detail in the spec file -- which is why it takes so long to get these things out. So, no, this is a scratch build from a fresh tarball. Does anyone else have time to try to duplicate the problem tonight? If it's replicatable at all, I think it's a release stopper. I have not yet been able to repeat the problem. I am running my fifth regression test run (which takes a long time on this P133) with a freshly initdb'ed PGDATA -- the previous regression runs were done on the same PGDATA tree as the first run was done on. Took 12 minutes 40 seconds, but I can't repeat the error. I'm hoping it was a problem on my machine -- educate me on what caused the error so I can see if something in my setup did something not so nice. So, the score is one error out of six test runs, thus far. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
Lamar Owen [EMAIL PROTECTED] writes: I'm hoping it was a problem on my machine -- educate me on what caused the error Well, that's exactly what I'd like to know. The direct cause of the error is that DROP TABLE is finding that some other backend has a reference-count hold on a page of the temp table it's trying to drop. Since no other backend should be trying to touch this temp table, there's something pretty fishy here. Given that this is a parallel test, you may be looking at a low-probability timing-dependent failure. I'd say set up the machine and run repeat tests for an hour or three ... that's what I plan to do here. BTW, what postmaster parameters are you using --- -B and so forth? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Beta 6 Regression results on Redat 7.0.
I'm rerunning to see if it is intermittent. Second run -- no error. Running a third time..no error. Now I'm confused. What would cause such an error, Tom? I'm going to check on my Hmm, concurrent checkpoint? Probably we could simplify dirty test in ByfferSync() - ie test bufHdr-cntxDirty without holding shlock (and pin!) on buffer: should be good as long as we set cntxDirty flag *before* XLogInsert in access methods. Have to look more... Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
On Tue, 20 Mar 2001, Tom Lane wrote: Since no other backend should be trying to touch this temp table, there's something pretty fishy here. I see. Given that this is a parallel test, you may be looking at a low-probability timing-dependent failure. I'd say set up the machine and run repeat tests for an hour or three ... that's what I plan to do here. As a broadcast engineer, I'm a little too familiar with such things. But this isn't an engineer list, so I'll spare you the war stories. :-) BTW, what postmaster parameters are you using --- -B and so forth? Default. To be changed before RPM release, but currently it is the default. The only option that postmaster.opts records is -D, and I'm not passing anything else. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
On Tue, 20 Mar 2001, Tom Lane wrote: Lamar Owen [EMAIL PROTECTED] writes: I'm hoping it was a problem on my machine -- educate me on what caused the error Well, that's exactly what I'd like to know. The direct cause of the error is that DROP TABLE is finding that some other backend has a reference-count hold on a page of the temp table it's trying to drop. Since no other backend should be trying to touch this temp table, there's something pretty fishy here. Given that this is a parallel test, you may be looking at a low-probability timing-dependent failure. I'd say set up the machine and run repeat tests for an hour or three ... that's what I plan to do here. Okay, I roll'd an RC1 but haven't put it up for FTP yet ... I'll wait for a few hours to see if anyone can reproduce this, and, if not, put out what I've rolled ... say, 00:00AST ... ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Hmm, concurrent checkpoint? Probably we could simplify dirty test in ByfferSync() - ie test bufHdr-cntxDirty without holding shlock (and pin!) on buffer: should be good as long as we set cntxDirty flag *before* XLogInsert in access methods. Have to look more... Yes, I'm wondering if some other backend is trying to write/flush the buffer (maybe as part of a checkpoint, maybe not). But seems like we should have seen this before, if so; that's not a low- probability scenario, particularly with just 64 buffers... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
The Hermit Hacker [EMAIL PROTECTED] writes: Okay, I roll'd an RC1 but haven't put it up for FTP yet ... I'll wait for a few hours to see if anyone can reproduce this, and, if not, put out what I've rolled ... This will not be RC1 :-( I'm been running one backend doing repeated iterations of CREATE TABLE temptest(col int); INSERT INTO temptest VALUES (1); CREATE TEMP TABLE temptest(col int); INSERT INTO temptest VALUES (2); SELECT * FROM temptest; DROP TABLE temptest; SELECT * FROM temptest; DROP TABLE temptest; and another one doing repeated CHECKPOINTs. I've already gotten a couple occurrences of Lamar's failure. I think the problem is that BufferSync unconditionally does PinBuffer on each buffer, and holds the pin during intervals where it's released BufMgrLock, even if there's not really anything for it to do on that buffer. If someone else is running FlushRelationBuffers then it's possible for that routine to see a nonzero pin count when it looks. Vadim, what do you think about how to change this? I think this is BufferSync's fault not FlushRelationBuffers's ... regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
On Tue, 20 Mar 2001, Tom Lane wrote: This will not be RC1 :-( 'Ive already gotten a couple occurrences of Lamar's failure. Well, I was at least hoping it was a problem here -- particularly since I haven't been able to reproduce it. But, since it is not a local problem, I'm glad I caught it -- on the first regression test run, no less. I've run a dozen tests since without duplication. Although, like you, Tom, I'm curious as to why it hadn't showed up before -- is the fact that this is a slow machine a factor, possibly? Although I am now much more leery of our regression suite -- this issue isn't even tested, in reality. Do we have _any_ WAL-related tests? The parallel testing is a good thing -- but I wonder what boundary conditions aren't getting tested. -- Lamar Owen WGCR Internet Radio 1 Peter 4:11 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
Lamar Owen [EMAIL PROTECTED] writes: Although I am now much more leery of our regression suite The regression tests are not at all designed to test concurrent behavior, and never have been. The parallel form runs some tests in parallel, true, but those tests are deliberately designed not to interact. So I don't put any faith in the regression tests as a means to catch bugs like this. We need some thought and work on better concurrent tests... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
RE: [HACKERS] Beta 6 Regression results on Redat 7.0.
I think the problem is that BufferSync unconditionally does PinBuffer on each buffer, and holds the pin during intervals where it's released BufMgrLock, even if there's not really anything for it to do on that buffer. If someone else is running FlushRelationBuffers then it's possible for that routine to see a nonzero pin count when it looks. Vadim, what do you think about how to change this? I think this is BufferSync's fault not FlushRelationBuffers's ... I'm looking there right now... Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Beta 6 Regression results on Redat 7.0.
I think the problem is that BufferSync unconditionally does PinBuffer on each buffer, and holds the pin during intervals where it's released BufMgrLock, even if there's not really anything for it to do on that buffer. If someone else is running FlushRelationBuffers then it's possible for that routine to see a nonzero pin count when it looks. Further note: this bug does not arise in 7.0.* because in that code, BufferSync will only pin buffers that have been dirtied in the current transaction. This cannot affect a concurrent FlushRelationBuffers, which should be holding exclusive lock on the table it's flushing. Or can it? The above is safe enough for user tables, but on system tables we have a bad habit of releasing locks early. It seems possible that a VACUUM on a system table might see pins due to BufferSyncs running in concurrent transactions that have altered that system table. Perhaps this issue does explain some of the reports of FlushRelationBuffers failure that we've seen from the field. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Beta 6 Regression results on Redat 7.0.
Further note: this bug does not arise in 7.0.* because in that code, BufferSync will only pin buffers that have been dirtied in the current transaction. This cannot affect a concurrent FlushRelationBuffers, which should be holding exclusive lock on the table it's flushing. Or can it? The above is safe enough for user tables, but on system tables we have a bad habit of releasing locks early. It seems possible that a VACUUM on a system table might see pins due to BufferSyncs running in concurrent transactions that have altered that system table. Perhaps this issue does explain some of the reports of FlushRelationBuffers failure that we've seen from the field. Another possible source of this problem (in 7.0.X) is BufferReplace..? Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster