Re: [HACKERS] Database corruption in 7.0.3

2001-03-15 Thread Denis Perchine

Can confirm this. Get this just yesterday time ago...

Messages:

NOTICE:  Rel acm: TID 1697/217: OID IS INVALID. TUPGONE 1.

And lots of such lines...
And

pqReadData() -- backend closed the channel unexpectedly.
This probably means the backend terminated abnormally
before or while processing the request.


In the end :-((( I lost a library of our institute... :-((( But I have a 
backup!!! :- This table even have NO indices!!!

Program received signal SIGSEGV, Segmentation fault.
0x813837f in PageRepairFragmentation (page=0x82840b0 "") at bufpage.c:311
311 alignedSize = MAXALIGN((*lp).lp_len);
(gdb) bt
#0  0x813837f in PageRepairFragmentation (page=0x82840b0 "") at bufpage.c:311
#1  0x80a9b07 in vc_scanheap (vacrelstats=0x82675b0, onerel=0x8273428, 
vacuum_pages=0xbfffe928, fraged_pages=0xbfffe918) at vacuum.c:1022
#2  0x80a8e8b in vc_vacone (relid=27296, analyze=0 '\000', va_cols=0x0) at 
vacuum.c:599
#3  0x80a8217 in vc_vacuum (VacRelP=0xbfffe9b4, analyze=0 '\000', 
va_cols=0x0) at vacuum.c:299
#4  0x80a818b in vacuum (vacrel=0x8267400 "", verbose=1 '\001', analyze=0 
'\000', va_spec=0x0) at vacuum.c:223
#5  0x813fba5 in ProcessUtility (parsetree=0x8267418, dest=Remote) at 
utility.c:694
#6  0x813c16e in pg_exec_query_dest (query_string=0x820aaa0 "vacuum verbose 
acm;", dest=Remote, aclOverride=0 '\000') at postgres.c:617
#7  0x813c08e in pg_exec_query (query_string=0x820aaa0 "vacuum verbose acm;") 
at postgres.c:562
#8  0x813d4c3 in PostgresMain (argc=9, argv=0xb068, real_argc=9, 
real_argv=0xba3c) at postgres.c:1588
#9  0x811ace5 in DoBackend (port=0x8223068) at postmaster.c:2009
#10 0x811a639 in BackendStartup (port=0x8223068) at postmaster.c:1776
#11 0x811932f in ServerLoop () at postmaster.c:1037
#12 0x8118b0e in PostmasterMain (argc=9, argv=0xba3c) at postmaster.c:725
#13 0x80d5e5e in main (argc=9, argv=0xba3c) at main.c:93
#14 0x40111fee in __libc_start_main () from /lib/libc.so.6

This is plain 7.0.3.

On Thursday 15 March 2001 14:52, Tim Allen wrote:
 We have an application that we were running quite happily using pg6.5.3
 in various customer sites. Now we are about to roll out a new version of
 our application, and we are going to use pg7.0.3. However, in testing
 we've come across a couple of isolated incidents of database
 corruption. They are sufficiently rare that I can't reproduce the problem,
 nor can I put my finger on just what application behaviour causes the
 problems.

 The symptoms most often involve some sort of index corruption, which is
 reported by vacuum and it seems that vacuum can fix it. On occasion vacuum
 reports "invalid OID" or similar (sorry, don't have exact wording of
 message). On one occasion the database has been corrupted to the point of
 unusability (ie vacuum admitted that it couldn't fix the problem), and a
 dump/restore was required (thankfully that at least worked). The index
 corruption also occasionally manifests itself in the form of spurious
 uniqueness constraint violation errors.

 The previous version of our app using 6.5.3 has never shown the slightest
 symptom of database misbehaviour, to the best of my knowledge, despite
 fairly extensive use. So our expectations are fairly high :-).

 One thing that is different about the new version of our app is that we
 now use multiple connections to the database (previously we only had
 one). We can in practice have transactions in progress on several
 connections at once, and it is possible for some transactions to be rolled
 back under application control (ie explicit ROLLBACK; statement).

 I realise I haven't really provided an awful lot of information that would
 help identify the problem, so I shall attempt to be understanding if
 no-one can offer any useful suggestions. But I hope someone can :-). Has
 anyone seen this sort of problem before? Are there any known
 database-corrupting bugs in 7.0.3? I don't recall anyone mentioning any in
 the mailing lists. Is using multiple connections likely to stimulate any
 known areas of risk?

 BTW we are using plain vanilla SQL, no triggers, no new types defined, no
 functions, no referential integrity checks, nothing more ambitious than a
 multi-column primary key.

 The platform is x86 Red Hat Linux 6.2. Curiously enough, on one of our
 testing boxes and on my development box we have never seen this, but we
 have seen it several times on our other test box and at least one customer
 site, so there is some possibility it's related to dodgy hardware. The
 customer box with the problem is a multi-processor box, all the other
 boxes we've tested on are single-processor.

 TIA for any help,

 Tim

-- 
Sincerely Yours,
Denis Perchine

--
E-Mail: [EMAIL PROTECTED]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
--

---(end of broadcast)---
TIP 6: Have you searched our list 

Re: [HACKERS] Database corruption in 7.0.3

2001-03-15 Thread Tom Lane

Tim Allen [EMAIL PROTECTED] writes:
 Are there any known database-corrupting bugs in 7.0.3?

None that aren't also in earlier releases, AFAIR, so your report is
fairly troubling.  However there's not enough here to venture a guess
about the source of the problem.

Do you see any backend crashes or other misbehavior before the VACUUM
error pops up, or is that the only symptom?

It would be a good idea to rebuild the system with assert checks on
(configure --enable-cassert), in hopes that some Assert a little closer
to the source of the problem will fire.  Also, if you can spare some
disk space for logging, running the postmaster with -d2 to log all
queries might provide useful historical context when the problem
reappears.

I would like to be able to study the corrupted table, as well.  Can you
see your way to either giving me access to your machine, or (if the
database isn't too large) sending me a tar dump of the whole $PGDATA
directory next time it happens?

Please contact me off-list so we can figure out how best to pursue this
problem.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster