Re: dsa_allocate() faliure

2019-01-29 Thread Fabio Isabettini
Hello,
 we are facing a similar issue on a Production system using a Postgresql 10.6:

org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: 
EXCEPTION on getstatistics_media ; ID: uidatareader.
run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free pages


The query reads remotely (via pl/proxy) tables containing a lot of data (up to 
millions or rows for each table/node) after a remote “group by" returns to the 
caller “master” node only a few hundreds of rows from each “slave” node.
The tables are partitioned using the INHERITANCE method that we are using since 
years with no issue. All tables have the same columns structure and number, 
about 300 columns. In the query there are no join, only a variable set of 
partitions depending on the date range.
The “REMOTE FATAL” refers to the pl/proxy that runs on 2 different slaves, [a0] 
and [a1], nodes with identical configuration and database structure, but it 
seems to fail only on node [a1].
When we get the error if we reduce the date range and therefore the quantity of 
data read, the error disappears, the same if we set 
max_parallel_workers_per_gather = 0.
Obviously we cannot force the user to use short periods of time to avoid the 
error and so we have disabled the parallel query feature for the time being.
It is difficult to reproduce the issue because not always the user gets the 
error, furthermore re-running the same query in different moments/days it 
usually works. It is a kind of weird.
We would like not to stop the Production system and upgrade it to PG11. And 
even though would this guarantee a permanent fix? 
Any suggestion? 


Regards,
Fabio Isabettini
Voipfuture (Germany)



The failing node [a1] configuration:

OS: Centos 7 kernerl 3.10.0-862.11.6.el7.x86_64
Postgres: postgres-10.5-862.11.6.1
RAM: 256 GB (The main server containing the master node and [a0] node, the 
slave that has no issue, has 384 GB of RAM)
CPU cores: 32

shared_buffers = 64GB
max_worker_processes = 32
max_parallel_workers_per_gather = 8
max_parallel_workers = 32


> On 28. Jan 2019, at 19:56:01, Thomas Munro  
> wrote:
> 
> On Tue, Jan 29, 2019 at 2:50 AM Arne Roland  wrote:
>> does anybody have any idea what goes wrong here? Is there some additional 
>> information that could be helpful?
> 
> Hi Arne,
> 
> This seems to be a bug; that error should not be reached.  I wonder if
> it is a different manifestation of the bug reported as #15585 (ie some
> type of rare corruption).  Are you able to reproduce this
> consistently?  Can you please show the query plan?
> 
> -- 
> Thomas Munro
> http://www.enterprisedb.com
> 



Re: dsa_allocate() faliure

2019-01-30 Thread Fabio Isabettini
Hi Thomas,
it is a Production system and we don’t have permanent access to it.
Also to have an auto_explain feature always on, is not an option in production.
I will ask the customer to give us notice asap the error present itself to 
connect immediately and try to get a query plan.

Regards

Fabio Isabettini
www.voipfuture.com 

> On 30. Jan 2019, at 04:13:14, Thomas Munro  
> wrote:
> 
> On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
>  wrote:
>> we are facing a similar issue on a Production system using a Postgresql 10.6:
>> 
>> org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: 
>> EXCEPTION on getstatistics_media ; ID: uidatareader.
>> run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free 
>> pages
> 
>> We would like not to stop the Production system and upgrade it to PG11. And 
>> even though would this guarantee a permanent fix?
>> Any suggestion?
> 
> Hi Fabio,
> 
> Thanks for your report.  Could you please also show the query plan
> that runs on the "remote" node (where the error occurred)?
> 
> There is no indication that upgrading to PG11 would help here.  It
> seems we have an undiagnosed bug (in 10 and 11), and so far no one has
> been able to reproduce it at will.  I personally have chewed a lot of
> CPU time on several machines trying various plan shapes and not seen
> this or the possibly related symptom from bug #15585 even once.  But
> we have about three reports of each of the two symptoms.  One reporter
> wrote to me off-list to say that they'd seen #15585 twice, the second
> time by running the same query in a tight loop for 8 hours, and then
> not seen it again in the past 3 weeks.  Clearly there is issue needing
> a fix here, but I don't yet know what it is.
> 
> -- 
> Thomas Munro
> http://www.enterprisedb.com
>