On 2/19/17 5:27 AM, Robert Haas wrote:
(1) a multi-batch hash join, (2) a nested loop,
and (3) a merge join. (2) is easy to implement but will generate a
ton of random I/O if the table is not resident in RAM. (3) is most
suitable for very large tables but takes more work to code, and is
also li
On Sun, Feb 19, 2017 at 3:52 PM, Pavan Deolasee
wrote:
> This particular case of corruption results in a heap tuple getting indexed
> by a wrong key (or to be precise, indexed by its old value). So the only way
> to detect the corruption is to look at each index key and check if it
> matches with
On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas wrote:
> On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane wrote:
>
> > Ah, nah, scratch that. If any post-index-build pruning had occurred on a
> > page, the evidence would be gone --- the non-matching older tuples would
> > be removed and what remained wo
On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane wrote:
> I wrote:
>> However, you might be able to find it without so much random I/O.
>> I'm envisioning a seqscan over the table, in which you simply look for
>> HOT chains in which the indexed columns aren't all the same. When you
>> find one, you'd h
On Fri, Feb 17, 2017 at 9:31 AM, Tom Lane wrote:
> This seems like it'd be quite a different tool than amcheck, though.
> Also, it would only find broken-HOT-chain corruption, which might be
> a rare enough issue to not deserve a single-purpose tool.
FWIW, my ambition for amcheck is that it will
I wrote:
> However, you might be able to find it without so much random I/O.
> I'm envisioning a seqscan over the table, in which you simply look for
> HOT chains in which the indexed columns aren't all the same. When you
> find one, you'd have to do a pretty expensive index lookup to confirm
> wh
Peter Geoghegan writes:
> The difference with a test that could detect this variety of
> corruption is that that would need to visit the heap, which tends to
> be much larger than any one index, or even all indexes. That would
> probably need to be random I/O, too. It might be possible to mostly
>
On Fri, Feb 17, 2017 at 8:23 AM, Keith Fiske wrote:
> It's not the load I'm worried about, it's the locks that are required at
> some point during the rebuild. Doing an index rebuild here and there isn't a
> big deal, but trying to do it for an entire heavily loaded, multi-terabyte
> database is h
Keith Fiske wrote:
> I can understandable if it's simply not possible, but if it is, I think in
> any cases of data corruption, having some means to check for it to be sure
> you're in the clear would be useful.
Maybe it is possible. I just didn't try, since it didn't seem very
useful.
--
Álva
On Fri, Feb 17, 2017 at 11:12 AM, Alvaro Herrera
wrote:
> Keith Fiske wrote:
>
> > Was just curious if anyone was able to come up with any sort of method to
> > test whether an index was corrupted by this bug, other than just waiting
> > for bad query results? We've used concurrent index rebuildi
Keith Fiske wrote:
> Was just curious if anyone was able to come up with any sort of method to
> test whether an index was corrupted by this bug, other than just waiting
> for bad query results? We've used concurrent index rebuilding quite
> extensively over the years to remove bloat from busy sys
On Mon, Feb 6, 2017 at 10:17 PM, Amit Kapila
wrote:
> On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane wrote:
> > Amit Kapila writes:
> >> Hmm. Consider that the first time relcahe invalidation occurs while
> >> computing id_attrs, so now the retry logic will compute the correct
> >> set of attrs (co
On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane wrote:
> Amit Kapila writes:
>> Hmm. Consider that the first time relcahe invalidation occurs while
>> computing id_attrs, so now the retry logic will compute the correct
>> set of attrs (considering two indexes, if we take the example given by
>> Alvaro
On Mon, Feb 6, 2017 at 11:54 PM, Tom Lane wrote:
> After some discussion among the release team, we've concluded that the
> best thing to do is to push Pavan's/my patch into today's releases.
> This does not close the matter by any means: we should continue to
> study whether there are related bu
After some discussion among the release team, we've concluded that the
best thing to do is to push Pavan's/my patch into today's releases.
This does not close the matter by any means: we should continue to
study whether there are related bugs or whether there's a more principled
way of fixing this
On Sun, Feb 5, 2017 at 9:42 PM, Pavan Deolasee wrote:
> On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan wrote:
>> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote:
>> > I don't think this kind of black-and-white thinking is very helpful.
>> > Obviously, data corruption is bad. However, this bu
Alvaro Herrera writes:
> Tom Lane wrote:
>> Better to fix the callers so that they don't have the assumption you
>> refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap
>> so that it returns all the sets needed by a given calling module at
>> once, which would allow us to guar
Tom Lane wrote:
> Better to fix the callers so that they don't have the assumption you
> refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap
> so that it returns all the sets needed by a given calling module at
> once, which would allow us to guarantee they're consistent.
No
Amit Kapila writes:
> Hmm. Consider that the first time relcahe invalidation occurs while
> computing id_attrs, so now the retry logic will compute the correct
> set of attrs (considering two indexes, if we take the example given by
> Alvaro above.). However, the attrs computed for hot_* and key
Andres Freund wrote:
> To show what I mean here's an *unpolished* and *barely tested* patch
> implementing the first of my suggestions.
>
> Alvaro, Pavan, I think should address the issue as well?
Hmm, interesting idea. Maybe a patch along these lines is a good way to
make index-list cache less
Tom Lane wrote:
> Pavan Deolasee writes:
> > 2. In the second patch, we tried to recompute attribute lists if a relcache
> > flush happens in between and index list is invalidated. We've seen problems
> > with that, especially it getting into an infinite loop with
> > CACHE_CLOBBER_ALWAYS. Not c
El 05/02/17 a las 21:57, Tomas Vondra escribió:
>
> +1 to not rushing fixes into releases. While I think we now finally
> understand the mechanics of this bug, the fact that we came up with
> three different fixes in this thread, only to discover issues with each
> of them, warrants some caution.
> 6 февр. 2017 г., в 4:57, Peter Geoghegan написал(а):
>
> I meant that I find the fact that there were no user reports in all
> these years to be a good reason to not proceed for now in this
> instance.
Well, we had some strange situations with indexes (see below for example) but I
couldn’t e
On Mon, Feb 6, 2017 at 9:47 AM, Pavan Deolasee wrote:
>
>
> On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila wrote:
>>
>>
>>
>> Hmm. Consider that the first time relcahe invalidation occurs while
>> computing id_attrs, so now the retry logic will compute the correct
>> set of attrs (considering two i
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee
wrote:
>
>>
> I like this approach. I'll run the patch on a build with
> CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok.
>
While it looks certain that the fix will miss this point release, FWIW I
ran this patch with CACHE_CLOBBER_ALWAYS and
On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila wrote:
>
>
> Hmm. Consider that the first time relcahe invalidation occurs while
> computing id_attrs, so now the retry logic will compute the correct
> set of attrs (considering two indexes, if we take the example given by
> Alvaro above.).
I don't
On Mon, Feb 6, 2017 at 8:35 AM, Pavan Deolasee wrote:
>
>
> On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila wrote:
>>
>> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee
>> wrote:
>> >
>> >
>> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote:
>> >>
>> >>
>> >>
>> >> > 2. In the second patch, we tried
On 2017-02-05 22:34:34 -0500, Tom Lane wrote:
> Pavan Deolasee writes:
> The point is that there's a nontrivial chance of a hasty fix introducing
> worse problems than we fix.
>
> Given the lack of consensus about exactly how to fix this, I'm feeling
> like it's a good idea if whatever we come up
Pavan Deolasee writes:
> On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund wrote:
>> +1. I don't think we serve our users by putting out a nontrivial bugfix
>> hastily. Nor do I think we serve them in this instance by delaying the
>> release to cover this fix; there's enough other fixes in the relea
On 2017-02-05 21:49:57 -0500, Tom Lane wrote:
> Andres Freund writes:
> > I've not yet read the full thread, but I'm a bit confused so far. We
> > obviously can get changing information about indexes here, but isn't
> > that something we have to deal with anyway? If we guarantee that we
> > don't
On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund wrote:
> On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
> > Michael Paquier writes:
> > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule
> wrote:
> > >> I agree with Pavan - a release with known important bug is not good
> idea.
> >
> > > This bug has
On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila wrote:
> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee
> wrote:
> >
> >
> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote:
> >>
> >>
> >>
> >> > 2. In the second patch, we tried to recompute attribute lists if a
> >> > relcache
> >> > flush happens
Andres Freund writes:
> I've not yet read the full thread, but I'm a bit confused so far. We
> obviously can get changing information about indexes here, but isn't
> that something we have to deal with anyway? If we guarantee that we
> don't loose knowledge that there's a pending invalidation, wh
On 2017-02-06 08:08:01 +0530, Amit Kapila wrote:
> I don't see in your patch that you are setting rd_bitmapsvalid to 0.
IIRC a plain relcache rebuild should do that (note there's also no place
that directly resets rd_indexattrs).
> Also, I think this suffers from the same problem as the patch pr
On Sun, Feb 5, 2017 at 6:42 PM, Pavan Deolasee wrote:
> I'm not sure that just because the bug wasn't reported by a user, makes it
> any less critical. As Tomas pointed down thread, the nature of the bug is
> such that the users may not discover it very easily, but that doesn't mean
> it couldn't
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee wrote:
>
>
> On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote:
>>
>>
>>
>> > 2. In the second patch, we tried to recompute attribute lists if a
>> > relcache
>> > flush happens in between and index list is invalidated. We've seen
>> > problems
>> > wit
On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan wrote:
> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote:
> > I don't think this kind of black-and-white thinking is very helpful.
> > Obviously, data corruption is bad. However, this bug has (from what
> > one can tell from this thread) been wi
On Mon, Feb 6, 2017 at 6:27 AM, Andres Freund wrote:
> Hi,
>
> On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
>> > RelationGetIndexList(Relation relation)
>> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
>> > * we can include system attributes (e.g., OID) in the bitmap
On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote:
>
>
> > 2. In the second patch, we tried to recompute attribute lists if a
> relcache
> > flush happens in between and index list is invalidated. We've seen
> problems
> > with that, especially it getting into an infinite loop with
> > CACHE_CLOBBER
On Sun, Feb 5, 2017 at 4:57 PM, Tomas Vondra
wrote:
> OTOH I disagree with the notion that bugs that are not driven by user
> reports are somehow less severe. Some data corruption bugs cause quite
> visible breakage - segfaults, immediate crashes, etc. Those are pretty clear
> bugs, and are report
Hi,
On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
> > RelationGetIndexList(Relation relation)
> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
> > * we can include system attributes (e.g., OID) in the bitmap
> > representation.
> > *
> > * Caller had better hold at
On 02/06/2017 01:11 AM, Peter Geoghegan wrote:
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote:
I don't think this kind of black-and-white thinking is very
helpful. Obviously, data corruption is bad. However, this bug has
(from what one can tell from this thread) been with us for over a
decad
On 2017-02-05 15:14:59 -0500, Tom Lane wrote:
> I do not like any of the other patches proposed in this thread, because
> they fail to guarantee delivering an up-to-date attribute bitmap to the
> caller. I think we need a retry loop, and I think that it needs to look
> like the attached.
That loo
On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
> Michael Paquier writes:
> > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule
> > wrote:
> >> I agree with Pavan - a release with known important bug is not good idea.
>
> > This bug has been around for some time, so I would recommend taking
> > the t
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote:
> I don't think this kind of black-and-white thinking is very helpful.
> Obviously, data corruption is bad. However, this bug has (from what
> one can tell from this thread) been with us for over a decade; it must
> necessarily be either low-prob
On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués wrote:
> El 05/02/17 a las 10:03, Michael Paquier escribió:
>> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule
>> wrote:
>>> I agree with Pavan - a release with known important bug is not good idea.
>>
>> This bug has been around for some time, so I w
[ Having now read the whole thread, I'm prepared to weigh in ... ]
Pavan Deolasee writes:
> This seems like a real problem to me. I don't know what the consequences
> are, but definitely having various attribute lists to have different view
> of the set of indexes doesn't seem right.
For sure.
El 05/02/17 a las 10:03, Michael Paquier escribió:
> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote:
>> I agree with Pavan - a release with known important bug is not good idea.
>
> This bug has been around for some time, so I would recommend taking
> the time necessary to make the best fix
2017-02-05 18:51 GMT+01:00 Tom Lane :
> Michael Paquier writes:
> > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule
> wrote:
> >> I agree with Pavan - a release with known important bug is not good
> idea.
>
> > This bug has been around for some time, so I would recommend taking
> > the time neces
Michael Paquier writes:
> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote:
>> I agree with Pavan - a release with known important bug is not good idea.
> This bug has been around for some time, so I would recommend taking
> the time necessary to make the best fix possible, even if it means
>
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote:
> I agree with Pavan - a release with known important bug is not good idea.
This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of min
2017-02-05 7:54 GMT+01:00 Pavan Deolasee :
>
> On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane wrote:
>
>>
>> Based on Pavan's comments, I think trying to force this into next week's
>> releases would be extremely unwise. If the bug went undetected this long,
>> it can wait for a fix for another three
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane wrote:
>
> Based on Pavan's comments, I think trying to force this into next week's
> releases would be extremely unwise. If the bug went undetected this long,
> it can wait for a fix for another three months.
Yes, I think bug existed ever since and we
Alvaro Herrera writes:
> I intend to commit this soon to all branches, to ensure it gets into the
> next set of minors.
Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise. If the bug went undetected this long,
it can wait for a fix for ano
On Sat, Feb 4, 2017 at 12:10 PM, Amit Kapila
wrote:
>
>
> If we do above, then I think primary key attrs won't be returned
> because for those we are using relation copy rather than an original
> working copy of attrs. See code below:
>
> switch (attrKind)
> {
> ..
> case INDEX_ATTR_BITMAP_PRIMAR
On Sat, Feb 4, 2017 at 12:12 AM, Alvaro Herrera
wrote:
> Pavan Deolasee wrote:
>
>> Looking at the history and some past discussions, it seems Tomas reported
>> somewhat similar problem and Andres proposed a patch here
>> https://www.postgresql.org/message-id/20140514155204.ge23...@awork2.anarazel
Pavan Deolasee wrote:
> Looking at the history and some past discussions, it seems Tomas reported
> somewhat similar problem and Andres proposed a patch here
> https://www.postgresql.org/message-id/20140514155204.ge23...@awork2.anarazel.de
> which got committed via b23b0f5588d2e2. Not exactly the
On Thu, Feb 2, 2017 at 10:14 PM, Alvaro Herrera
wrote:
>
>
> I'm going to study the bug a bit more, and put in some patch before the
> upcoming minor tag on Monday.
>
>
Looking at the history and some past discussions, it seems Tomas reported
somewhat similar problem and Andres proposed a patch h
On Thu, Feb 2, 2017 at 6:14 PM, Amit Kapila wrote:
>
> /*
> + * If the index list was invalidated, we better also invalidate the index
> + * attribute list (which should automatically invalidate other attributes
> + * such as primary key and replica identity)
> + */
>
> + relation->rd_indexattr
Pavan Deolasee wrote:
> I can reproduce this entire scenario using gdb sessions. This also explains
> why the patch I sent earlier helps to solve the problem.
Ouch. Great detective work there.
I think it's quite possible that this bug explains some index errors,
such as primary keys (or unique
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee
wrote:
>
> Based on my investigation so far and the evidence that I've collected,
> what probably happens is that after a relcache invalidation arrives at the
> new transaction, it recomputes the rd_indexattr but based on the old,
> cached rd_indexl
On Mon, Jan 30, 2017 at 7:20 PM, Pavan Deolasee
wrote:
> Hello All,
>
> While stress testing WARM, I encountered a data consistency problem. It
> happens when index is built concurrently. What I thought to be a WARM
> induced bug and it took me significant amount of investigation to finally
> conc
Hello All,
While stress testing WARM, I encountered a data consistency problem. It
happens when index is built concurrently. What I thought to be a WARM
induced bug and it took me significant amount of investigation to finally
conclude that it's a bug in the master. In fact, we tested all the way
For the record, here are the results of our (ongoing) inevstigation into
the index/heap corruption problems I reported a couple of weeks ago.
We were able to trigger the problem with kernels 2.6.16, 2.6.17 and
2.6.18.rc1, with 2.6.16 seeming to be the most flaky.
By replacing the NFS-mounted neta
Marc Munro <[EMAIL PROTECTED]> writes:
> We tried all of these suggestions and still get the problem. Nothing
> interesting in the log file so I guess the Asserts did not fire.
Not surprising, it was a long shot that any of those things were really
broken. But worth testing.
> We are going to t
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:
> One easy thing that would be worth trying is to build with
> --enable-cassert and see if any Asserts get provoked during the
> A couple other things to try, given that you can provoke the failure
> fairly easily:
> . . .
> 1. In studying the cod
Ühel kenal päeval, R, 2006-06-30 kell 12:05, kirjutas Jan Wieck:
> On 6/30/2006 11:55 AM, Tom Lane wrote:
>
> > Jan Wieck <[EMAIL PROTECTED]> writes:
> >> On 6/30/2006 11:17 AM, Marko Kreen wrote:
> >>> If the xxid-s come from different DB-s, then there can still be problems.
> >
> >> How so? The
Jan Wieck <[EMAIL PROTECTED]> writes:
> You're right ... forgot about that one.
> However, transactions from different origins are NEVER selected together
> and it wouldn't make sense to compare their xid's anyway. So the index
> might return index tuples for rows from another origin, but the
>
Tom Lane wrote:
> Brad Nicholson <[EMAIL PROTECTED]> writes:
>> It may or may not be the same issue, but for what it's worth, we've seen
>> the same sl_log_1 corruption on AIX 5.1 and 5.3
>
> Hm, on what filesystem, and what PG version(s)?
>
> I'm not completely satisfied by the its-a-kernel-bu
Brad Nicholson <[EMAIL PROTECTED]> writes:
> It may or may not be the same issue, but for what it's worth, we've seen
> the same sl_log_1 corruption on AIX 5.1 and 5.3
Hm, on what filesystem, and what PG version(s)?
I'm not completely satisfied by the its-a-kernel-bug theory, because if
it were
Brad Nicholson wrote:
> Tom Lane wrote:
>> Marc Munro <[EMAIL PROTECTED]> writes:
>>> I'll get back to you with kernel build information tomorrow. We'll also
>>> try to talk to some kernel hackers about this.
>> Some googling turned up recent discussions about race conditions in
>> Linux NFS code:
On 6/30/2006 11:55 AM, Tom Lane wrote:
Jan Wieck <[EMAIL PROTECTED]> writes:
On 6/30/2006 11:17 AM, Marko Kreen wrote:
If the xxid-s come from different DB-s, then there can still be problems.
How so? They are allways part of a multi-key index having the
originating node ID first.
Really?
Tom Lane wrote:
> Marc Munro <[EMAIL PROTECTED]> writes:
>> I'll get back to you with kernel build information tomorrow. We'll also
>> try to talk to some kernel hackers about this.
>
> Some googling turned up recent discussions about race conditions in
> Linux NFS code:
>
> http://threebit.net/
Jan Wieck <[EMAIL PROTECTED]> writes:
> On 6/30/2006 11:17 AM, Marko Kreen wrote:
>> If the xxid-s come from different DB-s, then there can still be problems.
> How so? They are allways part of a multi-key index having the
> originating node ID first.
Really?
create table @[EMAIL PROTECTED] (
On 6/30/2006 11:17 AM, Marko Kreen wrote:
On 6/30/06, Jan Wieck <[EMAIL PROTECTED]> wrote:
With the final implementation of log switching, the problem of xxid
wraparound will be avoided entirely. Every now and then slony will
switch from one to another log table and when the old one is drained
On 6/30/06, Jan Wieck <[EMAIL PROTECTED]> wrote:
With the final implementation of log switching, the problem of xxid
wraparound will be avoided entirely. Every now and then slony will
switch from one to another log table and when the old one is drained and
logically empty, it is truncated, which
On 6/30/2006 9:55 AM, Tom Lane wrote:
"Marko Kreen" <[EMAIL PROTECTED]> writes:
The sl_log_* tables are indexed on xid, where the relations between
values are not exactly stable. When having high enough activity on
one node or having nodes with XIDs on different enough positions
funny things ha
I trawled through the first, larger dump you sent me, and found multiple
index entries pointing to quite a few heap tuples:
Occurrences block item
2 43961 1
2 43961 2
2 43961 3
2 43961 4
"Marko Kreen" <[EMAIL PROTECTED]> writes:
> The sl_log_* tables are indexed on xid, where the relations between
> values are not exactly stable. When having high enough activity on
> one node or having nodes with XIDs on different enough positions
> funny things happen.
Yeah, that was one of the
On 6/30/06, Tom Lane <[EMAIL PROTECTED]> wrote:
I don't know the kernel nearly well enough to guess if these are related
...
The sl_log_* tables are indexed on xid, where the relations between
values are not exactly stable. When having high enough activity on
one node or having nodes with XIDs
Marc Munro <[EMAIL PROTECTED]> writes:
> I'll get back to you with kernel build information tomorrow. We'll also
> try to talk to some kernel hackers about this.
Some googling turned up recent discussions about race conditions in
Linux NFS code:
http://threebit.net/mail-archive/linux-kernel/msg0
On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote:
> [ back to the start of the thread... ]
>
> BTW, a couple of thoughts here:
>
> * If my theory about the low-level cause is correct, then reindexing
> sl_log_1 would make the "duplicate key" errors go away, but nonetheless
> you'd have lost data
Marc Munro <[EMAIL PROTECTED]> writes:
> By dike out, you mean remove? Please confirm and I'll try it.
Right, just remove (or comment out) the lines I quoted.
> We ran this system happily for nearly a year on the
> previous kernel without experiencing this problem (tcp lockups are a
> different
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote:
> One easy thing that would be worth trying is to build with
> --enable-cassert and see if any Asserts get provoked during the
> failure case. I don't have a lot of hope for that, but it's
> something that would require only machine time not peopl
[ back to the start of the thread... ]
Marc Munro <[EMAIL PROTECTED]> writes:
> We have now experienced index corruption on two separate but identical
> slony clusters. In each case the slony subscriber failed after
> attempting to insert a duplicate record. In each case reindexing the
> sl_log_
Marc Munro <[EMAIL PROTECTED]> writes:
> If there's anything we can do to help debug this we will. We can apply
> patches, different build options, etc.
One easy thing that would be worth trying is to build with
--enable-cassert and see if any Asserts get provoked during the
failure case. I don'
On Thu, 2006-06-29 at 19:59 -0400, Tom Lane wrote:
> Ummm ... you did restart the server? "select version();" would be
> the definitive test.
Can't say I blame you for the skepticism but I have confirmed it again.
test=# select version();
version
I wrote:
> What I speculate right at the moment is that we are not looking at index
> corruption at all, but at heap corruption: somehow, the first insertion
> into ctid (27806,2) got lost and the same ctid got re-used for the next
> inserted row. We fixed one bug like this before ...
Further stu
On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote:
> Marc: do you have triggers on some replicated tables ?
>
We have a non-slony trigger on only 2 tables, neither of them involved
in this transaction. We certainly have no circular trigger structures.
> I remember having some corruption in
Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane:
> Marc Munro <[EMAIL PROTECTED]> writes:
> > Tom,
> > we have a newer and much smaller (35M) file showing the same thing:
>
> Thanks. Looking into this, what I find is that *both* indexes have
> duplicated entries for the same heap t
Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne:
> [EMAIL PROTECTED] (Marc Munro) writes:
> > As you see, slony is attempting to enter one tuple
> > ('374520943','22007','0') two times.
> >
> > Each previous time we have had this problem, rebuilding the indexes on
> > slony log t
Marc Munro <[EMAIL PROTECTED]> writes:
> Tom,
> we have a newer and much smaller (35M) file showing the same thing:
Thanks. Looking into this, what I find is that *both* indexes have
duplicated entries for the same heap tuple:
idx1:
Item 190 -- Length: 24 Offset: 3616 (0x0e20) Flags: USED
[EMAIL PROTECTED] (Marc Munro) writes:
> As you see, slony is attempting to enter one tuple
> ('374520943','22007','0') two times.
>
> Each previous time we have had this problem, rebuilding the indexes on
> slony log table (sl_log_1) has fixed the problem. I have not reindexed
> the table this ti
We have reproduced the problem again. This time it looks like vacuum is
not part of the picture. From the provider's log:
2006-06-29 14:02:41 CST DEBUG2 cleanupThread: 101.057 seconds for vacuuming
And from the subscriber's:
2006-06-29 13:00:43 PDT ERROR remoteWorkerThread_1: "insert into
"
Marc Munro <[EMAIL PROTECTED]> writes:
> As you see, slony is attempting to enter one tuple
> ('374520943','22007','0') two times.
> Each previous time we have had this problem, rebuilding the indexes on
> slony log table (sl_log_1) has fixed the problem. I have not reindexed
> the table this time
On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote:
> OK, so it's not an already-known problem.
>
> > We were able to corrupt the index within 90 minutes of starting up our
> > cluster. A slony-induced vacuum was under way on the provider at the
> > time the subscriber failed.
>
> You still haven
Marc Munro <[EMAIL PROTECTED]> writes:
> On Tom Lane's advice, we upgraded to Postgres 8.0.8.
OK, so it's not an already-known problem.
> We were able to corrupt the index within 90 minutes of starting up our
> cluster. A slony-induced vacuum was under way on the provider at the
> time the subsc
On Tom Lane's advice, we upgraded to Postgres 8.0.8. We also upgraded
slony to 1.1.5, due to some rpm issues. Apart from that everything is
as described below.
We were able to corrupt the index within 90 minutes of starting up our
cluster. A slony-induced vacuum was under way on the provider at
Marc Munro <[EMAIL PROTECTED]> writes:
> We have now experienced index corruption on two separate but identical
> slony clusters. In each case the slony subscriber failed after
> attempting to insert a duplicate record. In each case reindexing the
> sl_log_1 table on the provider fixed the proble
We have now experienced index corruption on two separate but identical
slony clusters. In each case the slony subscriber failed after
attempting to insert a duplicate record. In each case reindexing the
sl_log_1 table on the provider fixed the problem.
The latest occurrence was on our production
1 - 100 of 107 matches
Mail list logo