Re: dsa_allocate() faliure

2019-02-04 Thread Justin Pryzby
On Mon, Feb 04, 2019 at 08:31:47PM +, Arne Roland wrote:
> I could take a backup and restore the relevant tables on a throwaway system. 
> You are just suggesting to replace line 728
> elog(FATAL,
>  "dsa_allocate could not find %zu free 
> pages", npages);
> by
> elog(PANIC,
>  "dsa_allocate could not find %zu free 
> pages", npages);
> correct? Just for my understanding: why would the shutdown of the whole 
> instance create more helpful logging?

You'd also start with pg_ctl -c, which would allow it to dump core, which could
be inspected with GDB to show a backtrace and other internals, which up to now
nobody (including myself) has been able to provide.

Justin



RE: dsa_allocate() faliure

2019-02-04 Thread Arne Roland
It's definitely a quite a relatively complex pattern. The query I set you last 
time was minimal with respect to predicates (so removing any single one of the 
predicates converted that one into a working query).
> Huh.  Ok well that's a lot more frequent that I thought.  Is it always the 
> same query?  Any chance you can get the plan?  Are there more things going on 
> on the server, like perhaps concurrent parallel queries?
I had this bug occurring while I was the only one working on the server. I 
checked there was just one transaction with a snapshot at all and it was a 
autovacuum busy with a totally unrelated relation my colleague was working on.

The bug is indeed behaving like a ghost.
One child relation needed a few new rows to test a particular application a 
colleague of mine was working on. The insert triggered an autoanalyze and the 
explain changed slightly:
Besides row and cost estimates the change is that the line
Recheck Cond: (((COALESCE((fid)::bigint, fallback) ) >= 1) AND 
((COALESCE((fid)::bigint, fallback) ) <= 1) AND (gid && 
'{853078,853080,853082}'::integer[]))
is now 
Recheck Cond: ((gid && '{853078,853080,853082}'::integer[]) AND 
((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, 
fallback) ) <= 1))
and the error vanished.

I could try to hunt down another query by assembling seemingly random queries. 
I don't see a very clear pattern from the queries aborting with this error on 
our production servers. I'm not surprised that bug is had to chase on 
production servers. They usually are quite lively.

>If you're able to run a throwaway copy of your production database on another 
>system that you don't have to worry about crashing, you could just replace 
>ERROR with PANIC and run a high-speed loop of the query that crashed in 
>product, or something.  This might at least tell us whether it's reach that 
>condition via something dereferencing a dsa_pointer or something manipulating 
>the segment lists while allocating/freeing.

I could take a backup and restore the relevant tables on a throwaway system. 
You are just suggesting to replace line 728
elog(FATAL,
 "dsa_allocate could not find %zu free pages", 
npages);
by
elog(PANIC,
 "dsa_allocate could not find %zu free pages", 
npages);
correct? Just for my understanding: why would the shutdown of the whole 
instance create more helpful logging?

All the best
Arne


Re: ERROR: found xmin from before relfrozenxid

2019-02-04 Thread Mariel Cherkassky
dumping the table and then restoring it solved the case for me. select for
update didnt help..

thanks !

‫בתאריך יום ד׳, 30 בינו׳ 2019 ב-12:35 מאת ‪Mariel Cherkassky‬‏ <‪
mariel.cherkas...@gmail.com‬‏>:‬

> Hey,
> As I said, I'm getting this error for all the objects in a specific db. I
> cant even connect to the database, I immediatly getting this error.
> The bug was fixed in 9.6.10 but the db version is 9.6.10 so how can it
> happen ? The db was installed in that version from the first place and *no
> upgrade was done*
>
> ‫בתאריך יום ד׳, 30 בינו׳ 2019 ב-11:14 מאת ‪Alvaro Herrera‬‏ <‪
> alvhe...@2ndquadrant.com‬‏>:‬
>
>> On 2019-Jan-30, Mariel Cherkassky wrote:
>>
>> > It seems that the version of the db is 9.6.10 :
>> >
>> > psql -U db -d db -c "select version()";
>> > Password for user db:
>> > version
>> >
>> ---
>> > PostgreSQL 9.6.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
>> > 20120313 (Red Hat 4.4.7-23), 64-bit
>> > (1 row)
>> >
>> >
>> > and the error is still exist..
>>
>> Did you apply the suggested SELECT .. FOR UPDATE to the problem table?
>>
>> --
>> Álvaro Herrerahttps://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>


Sv: Re: Fw: server hardware tuning.

2019-02-04 Thread Andreas Joseph Krogh
På mandag 04. februar 2019 kl. 15:45:30, skrev Prince Pathria <
prince.path...@goevive.com >: Hi Suganthi, I 
can give you a start, some pro users can suggest you better.   1. Don't use 
this much of connections on a single postgres server. Use a connection pooler 
in front of it. 2. RAM: Depends upon how much data you want to be cached. 3. 
Use PCIe SATA SSD with RAID10, Postgres uses a lot of IO for its operations.  
4. For config tuning: https://pgtune.leopard.in.ua/#/ 
 Though please go through all params for more 
understanding   Happy to help :) Prince Pathria Systems Architect Intern Evive 
+91 9478670472goevive.com    There's no such thing as PCIe 
SATA, use PCIe or NVMe in RAID-10, it's quite affordable these days and 
meaningless not to use.   --
 Andreas Joseph Krogh

Re: Fw: server hardware tuning.

2019-02-04 Thread Prince Pathria
Hi Suganthi,
I can give you a start, some pro users can suggest you better.

1. Don't use this much of connections on a single postgres server. Use a
connection pooler in front of it.
2. RAM: Depends upon how much data you want to be cached.
3. Use PCIe SATA SSD with RAID10, Postgres uses a lot of IO for its
operations.
4. For config tuning: https://pgtune.leopard.in.ua/#/ Though please go
through all params for more understanding

Happy to help :)
Prince Pathria Systems Architect Intern Evive +91 9478670472 goevive.com


On Mon, Feb 4, 2019 at 6:07 PM suganthi Sekar  wrote:

>
> Hi ,
>
>
> I need know how to calculate hardware sizing for database or query
>
>
> RAM
>
> CPU
>
> Config tuning
>
>
> Requirement :
>
>
> 1100 concurrent connection
>
> 1600 column of table
>
> 1GB of data can be select and dynamic aggregation will happen
>
>
> Regards
>
> SuganthiSekar
>


Fw: server hardware tuning.

2019-02-04 Thread suganthi Sekar

Hi ,


I need know how to calculate hardware sizing for database or query


RAM

CPU

Config tuning


Requirement :


1100 concurrent connection

1600 column of table

1GB of data can be select and dynamic aggregation will happen


Regards

SuganthiSekar


Re: dsa_allocate() faliure

2019-02-04 Thread Thomas Munro
On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa  wrote:
> I see the error showing up every night on 2 different servers. But it's a bit 
> of a heisenbug because If I go there now it won't be reproducible.

Huh.  Ok well that's a lot more frequent that I thought.  Is it always
the same query?  Any chance you can get the plan?  Are there more
things going on on the server, like perhaps concurrent parallel
queries?

> It was suggested by Justin Pryzby that I recompile pg src with his patch that 
> would cause a coredump.

Small correction to Justin's suggestion: don't abort() after
elog(ERROR, ...), it'll never be reached.

> But I don't feel comfortable doing this especially if I would have to run 
> this with prod data.
> My question is. Can I do anything like increasing logging level or enable 
> some additional options?
> It's a production server but I'm willing to sacrifice a bit of it's 
> performance if that would help.

If you're able to run a throwaway copy of your production database on
another system that you don't have to worry about crashing, you could
just replace ERROR with PANIC and run a high-speed loop of the query
that crashed in product, or something.  This might at least tell us
whether it's reach that condition via something dereferencing a
dsa_pointer or something manipulating the segment lists while
allocating/freeing.

In my own 100% unsuccessful attempts to reproduce this I was mostly
running the same query (based on my guess at what ingredients are
needed), but perhaps it requires a particular allocation pattern that
will require more randomness to reach... hmm.

-- 
Thomas Munro
http://www.enterprisedb.com