Re: Simple DELETE on modest-size table runs 100% CPU forever

2019-11-16 Thread Craig James
Problem solved ... see below. Thanks everyone for your suggestions and
insights!

On Sat, Nov 16, 2019 at 7:16 AM Jeff Janes  wrote:

> On Fri, Nov 15, 2019 at 7:27 PM Craig James  wrote:
>
>> On Fri, Nov 15, 2019 at 2:45 PM Jeff Janes  wrote:
>> BTW, I'll note at this point that "analyze category_staging_8" prior to
>> this query made no difference.
>>
>
> Isn't that the wrong table to have analyzed? The offender here is
> "categories", not  "category_staging_8".  Is this some sort of inheritance
> situation?
>
>
>>
>>> What do you see in `select * from pg_stats where tablename='categories'
>>> and attname='category_id' \x\g\x`?
>>>
>>
>> db=> select * from pg_stats where tablename='categories' and
>> attname='category_id' \x\g\x;
>> Expanded display is on.
>>
>
>
>> ...
>>
> n_distinct | 21
>> most_common_vals   |
>> {4,3,2,10,11,13,12,16,9,6,7,5,15,23,14,25,24,1,26,28,27}
>> most_common_freqs  |
>> {0.2397,0.159933,0.0926667,0.0556,0.0555667,0.0546333,0.0525333,0.0439,0.0426667,0.0346333,0.0331,0.0302333,0.0288333,0.0240667,0.0224,0.0122333,0.011,0.0035,0.0023,0.00037,0.0001}
>>
>
> There is a path in the analyze code where if the least-seen value in the
> sample was seen more than once (i.e. no value was seen exactly once) then
> it assumes that the seen values are all the values that exist.  I think the
> logic behind that is dubious.  I think it is pretty clear that that is
> kicking in here.  But why?  I think the simple answer is that you analyzed
> the wrong table, and the statistics shown here might be accurate for some
> time in the past but are no longer accurate.  It is hard to see how a value
> present 5000 times in a table of 274602 rows could have evaded sampling if
> they were present at the time the sample was done.
>

As I mentioned in a reply to Andreas, I also added an "analyze ..." to the
other two tables as an experiment. It made no difference. However ...

Your comment about missing 5000 values solved the problem: those values
were only inserted in the previous SQL statement, inside of a transaction.
The code is reconciling two collections across two different servers: First
it inserts all new values, then it deletes obsolete values. So the "select
..." in question is including the very 5000 rows that were just inserted.

I added an "analyze" between the insert and the delete. Instant fix.

It also solves one other mystery: This query only caused problems on the
small test system, and has been working well on a production database with
about 100x more data. In production, each "category" is already populated
with a significant amount of data. The production system already has good
statistics, so this one insert/delete doesn't change the statistics.


> Cheers,
>
> Jeff
>

Thanks!
Craig


Re: Simple DELETE on modest-size table runs 100% CPU forever

2019-11-16 Thread Jeff Janes
On Fri, Nov 15, 2019 at 7:27 PM Craig James  wrote:

> On Fri, Nov 15, 2019 at 2:45 PM Jeff Janes  wrote:
> BTW, I'll note at this point that "analyze category_staging_8" prior to
> this query made no difference.
>

Isn't that the wrong table to have analyzed? The offender here is
"categories", not  "category_staging_8".  Is this some sort of inheritance
situation?


>
>> What do you see in `select * from pg_stats where tablename='categories'
>> and attname='category_id' \x\g\x`?
>>
>
> db=> select * from pg_stats where tablename='categories' and
> attname='category_id' \x\g\x;
> Expanded display is on.
>


> ...
>
n_distinct | 21
> most_common_vals   |
> {4,3,2,10,11,13,12,16,9,6,7,5,15,23,14,25,24,1,26,28,27}
> most_common_freqs  |
> {0.2397,0.159933,0.0926667,0.0556,0.0555667,0.0546333,0.0525333,0.0439,0.0426667,0.0346333,0.0331,0.0302333,0.0288333,0.0240667,0.0224,0.0122333,0.011,0.0035,0.0023,0.00037,0.0001}
>

There is a path in the analyze code where if the least-seen value in the
sample was seen more than once (i.e. no value was seen exactly once) then
it assumes that the seen values are all the values that exist.  I think the
logic behind that is dubious.  I think it is pretty clear that that is
kicking in here.  But why?  I think the simple answer is that you analyzed
the wrong table, and the statistics shown here might be accurate for some
time in the past but are no longer accurate.  It is hard to see how a value
present 5000 times in a table of 274602 rows could have evaded sampling if
they were present at the time the sample was done.

Cheers,

Jeff