Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

2023-11-29 Thread Chris Travers
On Thu, Nov 30, 2023 at 9:03 AM Abdul Qoyyuum 
wrote:

> Hi Chris,
>
> On Wed, Nov 29, 2023 at 7:38 PM Chris Travers 
> wrote:
>
>>
>>
>> On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum 
>> wrote:
>>
>>> Hi all,
>>>
>>> Knowing that it's a data corruption issue, the only way to fix this is
>>> to vacuum and reindex the database. What was suggested was the following:
>>>
>>> SET zero_damaged_pages = 0; # This is so that we can have the
>>> application to continue to run
>>> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem
>>> if possible.
>>> REINDEX DATABASE "core"; # Then do a reindex and clean it up.
>>>
>>
>> So first, to clear up some confusion on my part here:
>>
>> This procedure doesn't make a lot of sense to me.  But did it clear up
>> the issue?
>>
> Yeah it did fix the issue before (same issue as last year) and it has
> fixed the problem that just happened a few days ago (almost exactly a year
> ago).
>
>>
>> In any of these cases, it is extremely important to diagnose the system
>> properly.  If you have a fault in your storage device or RAID controller,
>> for example, you are asking for more corruption and data loss later.
>>
>
>> At first I thought maybe you mistyped something and then realized there
>> were a few issues with the process so it actually didn't make sense.
>>
>> First, zero_damaged_pages defaults to 0, and I can think of no reason to
>> set  it explicitly.
>> Secondly, a vacuum full has to reindex, so there is no reason to do a
>> reindex following.  Your whole procedure is limited to a vacuum full, when
>> a reindex is the only part that could affect this.   If it did work,
>> reindexing is the only part that would have been helpful.
>>
> Oh that makes sense actually. Thanks.
>

So for a temporary workaround, it sounds like reindexing helps for now, but
yeah this really needs deeper investigation.

>
>> On to the question of what to do next
>>
>>>
>>> We're on Postgresql 12. This has worked before it happened (almost
>>> exactly a year ago) and I think this needs a more permanent solution. I've
>>> looked at routine vacuuming and checked the autovacuum is set to on and the
>>> following configurations:
>>>
>>> core=> select name, setting from pg_settings where name like
>>> 'autovacuum%';
>>> name |  setting
>>> -+---
>>>  autovacuum  | on
>>>  autovacuum_analyze_scale_factor | 0.1
>>>  autovacuum_analyze_threshold| 50
>>>  autovacuum_freeze_max_age   | 2
>>>  autovacuum_max_workers  | 3
>>>  autovacuum_multixact_freeze_max_age | 4
>>>  autovacuum_naptime  | 60
>>>  autovacuum_vacuum_cost_delay| 2
>>>  autovacuum_vacuum_cost_limit| -1
>>>  autovacuum_vacuum_scale_factor  | 0.2
>>>  autovacuum_vacuum_threshold | 50
>>>  autovacuum_work_mem | -1
>>> (12 rows)
>>>
>>> Can anyone advise if there's anything else we can do? We have no clue
>>> what causes the invalid page block and we are running a High Availability
>>> cluster set up but we are hoping that there may be a way to mitigate it.
>>>
>>>
>> You need to figure out why the corruption is happening.  This is most
>> likely, in my experience, not a PostgreSQL bug, but usually something that
>> happens on the hardware layer or an environmental factor.  It could be
>> failin storage or CPU.  Or it could be something like bad electrical input
>> or insufficient cooling (I have seen index and even table corruption issues
>> from both of these).
>>
>> If this is a server you run, the first things I would check are:
>> 1.  Is there a good-quality UPS that the server is plugged into?  Are the
>> batteries in good working order?
>>
> The servers are dual powered and hooked up to both supplied electricity,
> with a backup generator and if that fails, it will switch over to the UPS.
> All of these are supplied and maintained by the data centre that the
> servers are at. There have been no electrical problems so far.
>
>> 2.  Is the server somewhere that may be sitting in a pocket of hot air?
>>
> As you can imagine, the data centre has air-conditioning and floored fans
> blowing hot air up and out, keeping all servers cooled. Checking on the
> blade servers that the database VM sits on shows that the temperature is
> optimally sitting at 65 degrees celsius. So I don't think it's a
> temperature problem either.
>

Ok so this is in a professionally run datacenter.  That does indeed
eliminate at least some of these issues.

>
>> Once you have ruled these out, the next things to check are CPU, memory,
>> and storage health.  Unfortunately checking these is harder but you can
>> check SMART indications, and other diagnostic indicators.
>>
> Would this top stats be useful? The database does get busy especially at
> peak hours.
>
> top - 10:04:25 up 782 days, 43 min,  2 users,  load average: 1.20, 1.14,
> 1.10
> 

Re: Could not read from file "pg_subtrans/00F5" at offset 122880: Success.

2023-11-29 Thread Kyotaro Horiguchi
At Wed, 29 Nov 2023 18:29:15 +0100, Alvaro Herrera  
wrote in 
> The code in master is completely different (it uses pg_pread rather than
> seek + read): it does test for errno and reports accordingly.
> 
> So, nothing to do here.

Oops! Thank you and sorry for the noise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Removing oids with pg_repack

2023-11-29 Thread Achilleas Mantzios

Στις 27/11/23 16:51, ο/η CG έγραψε:



On Wednesday, November 22, 2023 at 12:38:54 PM EST, Achilleas Mantzios 
 wrote:



Στις 22/11/23 15:14, ο/η CG έγραψε:


On Wednesday, November 22, 2023 at 01:20:18 AM EST, Achilleas Mantzios 
 
 wrote:



Στις 21/11/23 20:41, ο/η CG έγραψε:
I have a very large PostgreSQL 9.5 database that still has very large 
tables with oids. I'm trying to get rid of the oids with as little 
downtime as possible so I can prep the database for upgrade past 
PostgreSQL 11. I had a wild idea to mod pg_repack to write a new table 
without oids. I think it almost works.


To test out my idea I made a new table wipe_oid_test with oids. I 
filled it with a few rows of data.



But PostgreSQL still thinks that the table has oids:

mydata=# \d+ wipe_oid_test
                   Table "public.wipe_oid_test"
 Column | Type | Modifiers | Storage  | Stats target | Description
+--+---+--+--+-
 k      | text | not null  | extended |              |
 v      | text |           | extended |              |
Indexes:
    "wipe_oid_test_pkey" PRIMARY KEY, btree (k)
Has OIDs: yes
Except where does it mention in the pg_repack docs (or source) that it 
is meant to be used for NO OIDS conversion ?


It does not-- I was trying to leverage and tweak the base 
functionality of pg_repack which sets up triggers and migrates data. I 
figured if the target table was created without OIDs that when 
pg_repack did the "swap" operation that the new table would take over 
with the added bonus of not having oids.


I can modify pg_class and set relhasoids = false, but it isn't 
actually eliminating the oid column. `\d+` will report not report 
that it has oids, but the oid column is still present and returns the 
same result before updating pg_class.



Just Dont!

Noted. ;)

So I'm definitely missing something. I really need a point in the 
right direction Please help! ;)



There are a few of methods to get rid of OIDs :

- ALTER TABLE .. SET WITHOUT OIDS (just mentioning, you already 
checked that)


This makes the database unusable for hours and hours and hours because 
it locks the table entirely while it performs the operation. That's 
just something that we can't afford.


- Use table copy +  use of a trigger to log changes : 
https://dba.stackexchange.com/questions/259359/eliminating-oids-while-upgrading-postgresql-from-9-4-to-12 



That SO is not quite the effect I'm going for. The poster of that SO 
was using OIDS in their application and needed a solution to maintain 
those values after conversion. I simply want to eliminate them without 
the extraordinary downtime the database would experience during ALTER 
operations.
Sorry I meant this one : Stripping OIDs from tables in preparation for 
pg_upgrade 












Stripping OIDs from tables in preparation for pg_upgrade

I have a postgres database in RDS, file size approaching 1TB. We 
started in 2005, using ruby/activerecord/rails...





This is the same idea as the percona ETL strategy, and essentially 90% 
of what pg_repack already does (creates new tables, sets up triggers, 
locks the tables, and swaps new for old at the end of the process)





- Use of Inheritance (the most neat solution I have seen, this is 
what I used for a 2TB table conversion) : 
https://www.percona.com/blog/performing-etl-using-inheritance-in-postgresql/ 



This is closest to the effect I was going for. pg_repack essentially 
creates a second table and fills it with the data from the first 
table while ensuring standard db operations against that table 
continue to function while the data is being moved from the old table 
to the new table. The process outlined in the Percona ETL strategy 
has to be repeated per-table, which is work I was hoping to avoid by 
leveraging 95% of the functionality of pg_repack while supplying my 
own 5% as the resulting table would not have oids regardless of the 
source table's configuration.


For my experiment, Table A did have oids. Table B (created by 
pg_repack) did not (at least at creation). When the "swap" operation 
happened in pg_repack, the metadata for Table A was assigned to Table 
B. I'm just trying to figure out what metadata I need to change in 
the system tables to reflect the actual table structure.


I have the fallback position for the Percona ETL strategy. But I feel 
like I'm REALLY close with pg_repack and I just don't understand 
enough about the system internals to nudge it to correctness and need 
some expert 

Re: Installing extension temporal_tables for PG13

2023-11-29 Thread H
On 11/28/2023 12:38 AM, Adrian Klaver wrote:
> On 11/27/23 18:18, H wrote:
>> On November 27, 2023 8:48:35 PM GMT-05:00, Adrian Klaver 
>>  wrote:
>>> On 11/27/23 17:41, H wrote:
 On 11/27/2023 08:38 PM, Adrian Klaver wrote:
> On 11/27/23 17:17, H wrote:
>
>>> I don't use PGXN so I am flying blind here. You may need to do:
>>>
>>> sudo pgxn install temporal_tables
>>
>> I ran pgxn install temporal_tables as root so that should not be the problem.
>>
>
> On an old machine running openSUSE  I did:
>
> sudo pgxn install --pg_config=/usr/local/pgsql15/bin/pg_config temporal_tables
>
>
> Where this
>
>  l /usr/local/pgsql15/lib64/pgxs/src/makefiles/pgxs.mk
> -rw-r--r-- 1 root root 14768 Sep 25 09:23 
> /usr/local/pgsql15/lib64/pgxs/src/makefiles/pgxs.mk
>
> was already installed. I presume as part of the source build I did for 
> Postgres 15.
>
> Maybe the Centos packaging has a separate package you need to install to get 
> the PGXS infrastructure?
>
>
>
>
It took a little bit of work and required multiple steps since pgxs is part of 
the postgresql13-devel package. It also required upgrading LLVM and installing 
gcc. Note that I am running this particular postgresql13 installation in a 
docker container.

With pgxs installed I could then proceed to install temporal_tables extension 
using pgxn.

Thank you for pointing me in the right direction.





RE: [EXT] Re: Query runtime differences- trying to understand why.

2023-11-29 Thread Dirschel, Steve
Sorry.  DB is Aurora Postgres 14.6.  Hints are enabled via extension 
pg_hint_plan.  But my question isn't so much about the hints but the execution 
plan and where the time is going and why it appears for 1 execution of the 
query skipping locked rows seems to take a lot more time than another execution 
of the query.

Regards

-Original Message-
From: Andreas Kretschmer  
Sent: Wednesday, November 29, 2023 3:10 PM
To: pgsql-general@lists.postgresql.org
Subject: [EXT] Re: Query runtime differences- trying to understand why.

External Email: Use caution with links and attachments.



Am 29.11.23 um 21:25 schrieb Dirschel, Steve:
>
> I have a question on the execution time of a query and the 
> fluctuations I'm seeing.  I enabled auto_trace to capture some actual 
> executions of a query by an app.  Below are 2 executions of it.  The 
> top one took 1.697 milliseconds. The 2nd one took 31.241 milliseconds.  
> Note the query has hints in it
>

what database are you using? PostgreSQL doesn't hav hints...

Regards, Andreas

--
Andreas Kretschmer - currently still (garden leave) Technical Account Manager 
(TAM) 
https://urldefense.com/v3/__http://www.enterprisedb.com__;!!GFN0sa3rsbfR8OLyAw!eF3r308h3pnOusNslRf-zZZ1LUoQFkGOKK47Sm1ByfdIdLAtF0xPY4-4Xuo_F25zoV12fGtyNUelCdG10fZEKldOigKm0FXp$
 





Re: Query runtime differences- trying to understand why.

2023-11-29 Thread Andreas Kretschmer




Am 29.11.23 um 21:25 schrieb Dirschel, Steve:


I have a question on the execution time of a query and the 
fluctuations I'm seeing.  I enabled auto_trace to capture some actual 
executions of a query by an app.  Below are 2 executions of it.  The 
top one took 1.697 milliseconds. The 2nd one took 31.241 
milliseconds.  Note the query has hints in it




what database are you using? PostgreSQL doesn't hav hints...

Regards, Andreas

--
Andreas Kretschmer - currently still (garden leave)
Technical Account Manager (TAM)
www.enterprisedb.com





Query runtime differences- trying to understand why.

2023-11-29 Thread Dirschel, Steve
I have a question on the execution time of a query and the fluctuations I'm 
seeing.  I enabled auto_trace to capture some actual executions of a query by 
an app.  Below are 2 executions of it.  The top one took 1.697 milliseconds.  
The 2nd one took 31.241 milliseconds.  Note the query has hints in it and the 
top one did not obey the hint to use the pk index on table product.  That table 
is very small so full scanning it isn't really a problem.  When I look at the 
steps in the plan below I do not believe the difference in runtime is due to 
plan differences.

Here are my observations for the top query/plan:
-  The query found 318 rows when scanning the entry_guid_mod_idx index on table 
entry.  It did 336 or 337 shared block hits for that step and it took 619 
milliseconds for this step.
-  I believe it nested looped to the full scan of table product 318 times.  
That step took 812 milliseconds (time including prior step).  At that point the 
query did a total of 841 shared block hits.  I believe it found 318 rows that 
matched.
-  The LockRows step-  this is coming from the SELECT FOR UPDATE SKIP LOCKED.  
I believe this step shows it returned 0 rows.  Runtime was 1.696 seconds.  This 
implies all 318 rows that matched the query were all locked.

Observations for the bottom query/plan:
-  The query found 291 rows when scanning the entry_guid_mod_idx index on table 
entry.  It did 313 or 315 shared block sits for that step and it took 977 
milliseconds.
-  I believe it nested looped to the index scan pkproduct 291 times.  That step 
took 1.491 seconds (time including prior step).  Ad that point the query did a 
total of 1365 shared block hits.  I believe it found 291 rows that matched.
-  The LockRows step-  Again,  I believe this is coming from the SELECT FOR 
UPDATE SKIP LOCKED.  Like the top query this also shows it returned 0 rows 
which implies to me all 291 rows were locked.  But here we can see the runtime 
was 31.239 milliseconds.

So my question-  assuming my understanding above is correct both plans show it 
taking minimal time to get to the LockRows step.  The top query skipped more 
rows than the bottom query but the bottom query took significantly more time at 
this LockRows step.  Why would that be the case?

Thanks in advance.
Steve

2023-11-29 18:12:45 
UTC:10.210.202.108(50098):alertu@pgbatchshared01aq:[15251]:LOG:  duration: 
1.697 ms  plan:
 Query Text: /*+ NestLoop(t1 t2) IndexScan(t2 pkproduct) */ SELECT  
T1.ENTRY_GUID FROM bchalertdemo.ENTRY T1,bchalertdemo.PRODUCT T2  WHERE 
T1.NEXT_RUN_DATE<$1 AND T1.STATUS_CODE=1 AND T2.ACTIVE_STATUS != 'F'  AND 
(END_DATE IS NULL OR ($2 < END_DATE)) AND (($3 NOT BETWEEN SUSPEND_DATE AND 
RESUME_DATE) OR SUSPEND_DATE IS NULL OR RESUME_DATE IS NULL) AND 
T1.PRODUCT_CODE = T2.PRODUCT_CODE AND T1.LAST_RUN_DATE<$4 AND T2.PRODUCT_VALUE 
IN ($5) AND MOD(ABS(HASHTEXT(T1.ENTRY_GUID)),150) = $6 ORDER BY T1.PRIORITY, 
T1.ENTRY_FREQUENCY_CODE, T1.NEXT_RUN_DATE FOR UPDATE SKIP LOCKED LIMIT 1
 Limit  (cost=0.42..4.63 rows=1 width=64) (actual time=1.695..1.696 rows=0 
loops=1)
   Buffers: shared hit=841
   ->  LockRows  (cost=0.42..1393.48 rows=331 width=64) (actual 
time=1.695..1.695 rows=0 loops=1)
 Buffers: shared hit=841
 ->  Nested Loop  (cost=0.42..1390.17 rows=331 width=64) (actual 
time=0.032..0.812 rows=318 loops=1)
   Join Filter: (t1.product_code = t2.product_code)
   Buffers: shared hit=337
   ->  Index Scan using entry_guid_mod_idx on entry t1  
(cost=0.42..1384.18 rows=331 width=63) (actual time=0.016..0.619 rows=318 
loops=1)
 Index Cond: ((mod(abs(hashtext((entry_guid)::text)), 
150) = 141) AND (next_run_date < '2023-11-29 12:12:38.535'::timestamp without 
time zone) AND (last_run_date < '2023-11-29 12:12:38.535'::timestamp without 
time zone)
AND (status_code = '1'::numeric))
 Filter: (((end_date IS NULL) OR ('2023-11-29 
12:12:38.535'::timestamp without time zone < end_date)) AND (('2023-11-29 
12:12:38.535'::timestamp without time zone < suspend_date) OR ('2023-11-29 
12:12:38.535'::timestamp
without time zone > resume_date) OR 
(suspend_date IS NULL) OR (resume_date IS NULL)))
 Buffers: shared hit=336
   ->  Materialize  (cost=0.00..1.02 rows=1 width=20) (actual 
time=0.000..0.000 rows=1 loops=318)
 Buffers: shared hit=1
 ->  Seq Scan on product t2  (cost=0.00..1.01 rows=1 
width=20) (actual time=0.013..0.013 rows=1 loops=1)
   Filter: ((active_status <> 'F'::bpchar) AND 
((product_value)::text = 'bchalert-poc'::text))
   Buffers: shared hit=1

2023-11-29 18:12:45 
UTC:10.210.202.108(50006):alertu@pgbatchshared01aq:[15226]:LOG:  duration: 
31.241 ms  plan:
 Query Text: /*+ NestLoop(t1 t2) 

Re: Could not read from file "pg_subtrans/00F5" at offset 122880: Success.

2023-11-29 Thread Ron Johnson
On Wed, Nov 29, 2023 at 12:20 PM Sri Mrudula Attili  wrote:

> Hello Laurenz,
>
>
>   Thanks for your response.
>
>
> This error we are seeing on a delphix Virtual database that was
> refreshed using the snapshot of production standalone database.
>
>
> It keeps the database in pg_start_backup and to take the snapshot.
>

Isn't pg_basebackup the canonical method for taking a snapshot?
(PgBackRest, too, if you want more speed.)


Re: Could not read from file "pg_subtrans/00F5" at offset 122880: Success.

2023-11-29 Thread Alvaro Herrera
On 2023-Nov-28, Kyotaro Horiguchi wrote:

> By the way, just out of curiosity, but errno should not be zero at the
> time the message above was output, yet "%m" is showing "success",
> which implies errno = 0 in Linux. How can that happen?

If the file is exactly of the length given then seek will work, and read
will not set errno but return a length of 0 bytes.  So
SlruPhysicalReadPage sets SLRU_READ_FAILED and then in pg11 the %m in
SlruReportIOError expands to "Success".

The code in master is completely different (it uses pg_pread rather than
seek + read): it does test for errno and reports accordingly.

So, nothing to do here.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: Could not read from file "pg_subtrans/00F5" at offset 122880: Success.

2023-11-29 Thread Sri Mrudula Attili

Hello Laurenz,


 Thanks for your response.


This error we are seeing on a delphix Virtual database that was 
refreshed using the snapshot of production standalone database.



It keeps the database in pg_start_backup and to take the snapshot.


I did check the dsource(production) database logs and it hasnt got any 
error like this.



Thanks,

Sri Attili

On 27/11/2023 18:58, Laurenz Albe wrote:

On Mon, 2023-11-27 at 11:50 +, Sri Mrudula Attili wrote:

ERROR:  could not access status of transaction 16087052
DETAIL:  Could not read from file "pg_subtrans/00F5" at offset 122880: Success.
STATEMENT:  SELECT distinct

That's data corruption.
Time to restore your backup.

Investigate how you got there.  Did you run "pg_resetwal"?
Did you restore a database from a file system backup?
Did you remove a "backup_label" file?  Is your hardware broken?

Yours,
Laurenz Albe





Re: Emitting JSON to file using COPY TO

2023-11-29 Thread Davin Shearer
Thanks for the responses everyone.

I worked around the issue using the `psql -tc` method as Filip described.

I think it would be great to support writing JSON using COPY TO at
some point so I can emit JSON to files using a PostgreSQL function directly.

-Davin

On Tue, Nov 28, 2023 at 2:36 AM Filip Sedlák  wrote:

> This would be a very special case for COPY. It applies only to a single
> column of JSON values. The original problem can be solved with psql
> --tuples-only as David wrote earlier.
>
>
> $ psql -tc 'select json_agg(row_to_json(t))
>   from (select * from public.tbl_json_test) t;'
>
>   [{"id":1,"t_test":"here's a \"string\""}]
>
>
> Special-casing any encoding/escaping scheme leads to bugs and harder
> parsing.
>
> Just my 2c.
>
> --
> Filip Sedlák
>


Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

2023-11-29 Thread Chris Travers
On Wed, Nov 29, 2023 at 4:36 PM Abdul Qoyyuum 
wrote:

> Hi all,
>
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
>
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.
>

So first, to clear up some confusion on my part here:

This procedure doesn't make a lot of sense to me.  But did it clear up the
issue?

In any of these cases, it is extremely important to diagnose the system
properly.  If you have a fault in your storage device or RAID controller,
for example, you are asking for more corruption and data loss later.

At first I thought maybe you mistyped something and then realized there
were a few issues with the process so it actually didn't make sense.

First, zero_damaged_pages defaults to 0, and I can think of no reason to
set  it explicitly.
Secondly, a vacuum full has to reindex, so there is no reason to do a
reindex following.  Your whole procedure is limited to a vacuum full, when
a reindex is the only part that could affect this.   If it did work,
reindexing is the only part that would have been helpful.

On to the question of what to do next

>
> We're on Postgresql 12. This has worked before it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:
>
> core=> select name, setting from pg_settings where name like 'autovacuum%';
> name |  setting
> -+---
>  autovacuum  | on
>  autovacuum_analyze_scale_factor | 0.1
>  autovacuum_analyze_threshold| 50
>  autovacuum_freeze_max_age   | 2
>  autovacuum_max_workers  | 3
>  autovacuum_multixact_freeze_max_age | 4
>  autovacuum_naptime  | 60
>  autovacuum_vacuum_cost_delay| 2
>  autovacuum_vacuum_cost_limit| -1
>  autovacuum_vacuum_scale_factor  | 0.2
>  autovacuum_vacuum_threshold | 50
>  autovacuum_work_mem | -1
> (12 rows)
>
> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.
>
>
You need to figure out why the corruption is happening.  This is most
likely, in my experience, not a PostgreSQL bug, but usually something that
happens on the hardware layer or an environmental factor.  It could be
failin storage or CPU.  Or it could be something like bad electrical input
or insufficient cooling (I have seen index and even table corruption issues
from both of these).

If this is a server you run, the first things I would check are:
1.  Is there a good-quality UPS that the server is plugged into?  Are the
batteries in good working order?
2.  Is the server somewhere that may be sitting in a pocket of hot air?

Once you have ruled these out, the next things to check are CPU, memory,
and storage health.  Unfortunately checking these is harder but you can
check SMART indications, and other diagnostic indicators.

However, once these errors start happening, you are in danger territory and
need to find out why (and correct the underlying problem) before you get
data loss.

-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

2023-11-29 Thread Stephen Frost
Greetings,

* Abdul Qoyyuum (aqoyy...@cardaccess.com.bn) wrote:
> Knowing that it's a data corruption issue, the only way to fix this is to
> vacuum and reindex the database. What was suggested was the following:
> 
> SET zero_damaged_pages = 0; # This is so that we can have the application
> to continue to run
> VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
> possible.
> REINDEX DATABASE "core"; # Then do a reindex and clean it up.

This is only going to help if the issue is in an index, which isn't
clear from what's been shared.

> We're on Postgresql 12. This has worked before it happened (almost exactly
> a year ago) and I think this needs a more permanent solution. I've looked
> at routine vacuuming and checked the autovacuum is set to on and the
> following configurations:

This isn't something that should ever happen ...

This also doesn't have anything to do with autovacuum, changing settings
there won't make any difference.

> Can anyone advise if there's anything else we can do? We have no clue what
> causes the invalid page block and we are running a High Availability
> cluster set up but we are hoping that there may be a way to mitigate it.

Was there some kind of hardware fault?  Did you do a failover?  Restore
from a backup?  Do you have checksums enabled?  How many times has this
happened before, and how many pages were impacted?  What is the design
of your HA solution, are you using PG replication or something else?

Thanks,

Stephen


signature.asc
Description: PGP signature


Fixing or Mitigating this ERROR: invalid page in block 35217 of relation base/16421/3192429

2023-11-29 Thread Abdul Qoyyuum
Hi all,

Knowing that it's a data corruption issue, the only way to fix this is to
vacuum and reindex the database. What was suggested was the following:

SET zero_damaged_pages = 0; # This is so that we can have the application
to continue to run
VACUUM FULL VERBOSE ANALYSE; # Do a full vacuum and analyse the problem if
possible.
REINDEX DATABASE "core"; # Then do a reindex and clean it up.

We're on Postgresql 12. This has worked before it happened (almost exactly
a year ago) and I think this needs a more permanent solution. I've looked
at routine vacuuming and checked the autovacuum is set to on and the
following configurations:

core=> select name, setting from pg_settings where name like 'autovacuum%';
name |  setting
-+---
 autovacuum  | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_analyze_threshold| 50
 autovacuum_freeze_max_age   | 2
 autovacuum_max_workers  | 3
 autovacuum_multixact_freeze_max_age | 4
 autovacuum_naptime  | 60
 autovacuum_vacuum_cost_delay| 2
 autovacuum_vacuum_cost_limit| -1
 autovacuum_vacuum_scale_factor  | 0.2
 autovacuum_vacuum_threshold | 50
 autovacuum_work_mem | -1
(12 rows)

Can anyone advise if there's anything else we can do? We have no clue what
causes the invalid page block and we are running a High Availability
cluster set up but we are hoping that there may be a way to mitigate it.