Re: [HACKERS] Kernel Tainted

2016-10-05 Thread reiner peterke

> On Oct 5, 2016, at 9:43 PM, Tomas Vondra  wrote:
> 
> On 10/05/2016 08:41 PM, reiner peterke wrote:
>> Hi,
>> 
>> We are helping a client test an application On Power8 using Postgres
>> 9.5.4 which has been compiled specifically for the Power.
>> 
>> This is running on sles12sp1  the current kernel is 3.12.49-11
>> 
>> We are getting these kernel warning associated with the postmaster
>> process.  The application is handling around 15000TPS  It appears that
>> one of these messages is generated for each each transaction which fills
>> up the warn.log quite quickly.
>> 
>> I’m trying to understand what is causing the Tainted kernel messages.
>> the warning is at 'WARNING: at ../net/core/dst.c:287’.
>> I’ve found one link that indicates that this is ip6 related.
>> https://brunomgalmeida.wordpress.com/2015/07/23/disable-ipv6-postgres-and-pgbouncer/
>> Is this accurate?  And if these action resolve the error, is it more of
>> a bandaid then an actual fix?
>> 
> 
> As Andres already pointed out, this is most likely a kernel issue, not a 
> PostgreSQL one. The "tainted" has nothing to do with the cause, it's just a 
> way to inform users whether it's a clean kernel build, or if it includes code 
> not available in vanilla kernels etc. The "X" means there are some 
> SuSe-specific modules loaded, IIRC.
> 
> And yes, it seems IPv6 related, at least judging by the stack trace:
> 
> 0xc16f7d80 (unreliable)
> sk_dst_check+0x174/0x180
> ip6_sk_dst_lookup_flow+0x4c/0x2a0
> udpv6_sendmsg+0x688/0xb20
> inet_sendmsg+0x9c/0x120
> sock_sendmsg+0xec/0x140
> SyS_sendto+0x108/0x150
> SyS_send+0x50/0x70
> SyS_socketcall+0x2a0/0x440
> syscall_exit+0x0/0x7c
> 
> You should probably talk to SuSe or whoever supports that system.
> 
> regards
> 
> -- 
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for the clear information.
I think there are a few kernel upgrade we can apply first, then see if that 
fixes the problem.

Reiner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Kernel Tainted

2016-10-05 Thread reiner peterke
Hi,

We are helping a client test an application On Power8 using Postgres 9.5.4 
which has been compiled specifically for the Power.

This is running on sles12sp1  the current kernel is 3.12.49-11

We are getting these kernel warning associated with the postmaster process.  
The application is handling around 15000TPS  It appears that one of these 
messages is generated for each each transaction which fills up the warn.log 
quite quickly.

I’m trying to understand what is causing the Tainted kernel messages. the 
warning is at 'WARNING: at ../net/core/dst.c:287’.  
I’ve found one link that indicates that this is ip6 related.  
https://brunomgalmeida.wordpress.com/2015/07/23/disable-ipv6-postgres-and-pgbouncer/
 

Is this accurate?  And if these action resolve the error, is it more of a 
bandaid then an actual fix?

Any comments are appreciated.

A sample of the error is below.

Reiner

2016-10-05T15:08:50.219292+02:00 PPDLMREB04 kernel: [ cut here 
]
2016-10-05T15:08:50.219335+02:00 PPDLMREB04 kernel: WARNING: at 
../net/core/dst.c:287
2016-10-05T15:08:50.219341+02:00 PPDLMREB04 kernel: Modules linked in: 
af_packet xfs libcrc32c ibmveth(X) rtc_generic btrfs xor raid6_pq 
dm_service_time sr_mod sd_mod cdrom crc_t10dif ibmvfc(X) scsi_transport_fc 
ibmvscsi(X) scsi_transport_srp scsi_tgt dm_multipath scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua scsi_dh dm_mod sg scsi_mod autofs4
2016-10-05T15:08:50.219346+02:00 PPDLMREB04 kernel: Supported: Yes, External
2016-10-05T15:08:50.219350+02:00 PPDLMREB04 kernel: CPU: 28 PID: 113041 Comm: 
postmaster Tainted: G   X 3.12.49-11-default #1
2016-10-05T15:08:50.219355+02:00 PPDLMREB04 kernel: task: c003c31100d0 ti: 
c003c1f08000 task.ti: c003c1f08000
2016-10-05T15:08:50.219362+02:00 PPDLMREB04 kernel: NIP: c05c0bb0 LR: 
c0594bc4 CTR: c06a0ae0
2016-10-05T15:08:50.219415+02:00 PPDLMREB04 kernel: REGS: c003c1f0b630 
TRAP: 0700   Tainted: G   X  (3.12.49-11-default)
2016-10-05T15:08:50.219424+02:00 PPDLMREB04 kernel: MSR: 80029033 
  CR: 24022288  XER: 0016
2016-10-05T15:08:50.219430+02:00 PPDLMREB04 kernel: CFAR: c0594bc0 
SOFTE: 1 
2016-10-05T15:08:50.219438+02:00 PPDLMREB04 kernel: GPR00: c0594bc4 
c003c1f0b8b0 c0e8ff00 c003c35a1980 
2016-10-05T15:08:50.219444+02:00 PPDLMREB04 kernel: GPR04: 0002 
 0001  
2016-10-05T15:08:50.219448+02:00 PPDLMREB04 kernel: GPR08:  
0001  c0710810 
2016-10-05T15:08:50.219452+02:00 PPDLMREB04 kernel: GPR12: c06a0ae0 
c7b2fc00 7fff 003c 
2016-10-05T15:08:50.219457+02:00 PPDLMREB04 kernel: GPR16:  
10735620   
2016-10-05T15:08:50.219462+02:00 PPDLMREB04 kernel: GPR20: c003c1746d80 
0001  03a8 
2016-10-05T15:08:50.219466+02:00 PPDLMREB04 kernel: GPR24: c003c1f0b9f0 
  03a8 
2016-10-05T15:08:50.219470+02:00 PPDLMREB04 kernel: GPR28: 0002 
c003c1746a00  c003c35a1980 
2016-10-05T15:08:50.219475+02:00 PPDLMREB04 kernel: NIP [c05c0bb0] 
dst_release+0x50/0xa0
2016-10-05T15:08:50.219479+02:00 PPDLMREB04 kernel: LR [c0594bc4] 
sk_dst_check+0x174/0x180
2016-10-05T15:08:50.219484+02:00 PPDLMREB04 kernel: Call Trace:
2016-10-05T15:08:50.219489+02:00 PPDLMREB04 kernel: [c003c1f0b8b0] 
[c16f7d80] 0xc16f7d80 (unreliable)
2016-10-05T15:08:50.219495+02:00 PPDLMREB04 kernel: [c003c1f0b8e0] 
[c0594bc4] sk_dst_check+0x174/0x180
2016-10-05T15:08:50.219501+02:00 PPDLMREB04 kernel: [c003c1f0b920] 
[c068d4cc] ip6_sk_dst_lookup_flow+0x4c/0x2a0
2016-10-05T15:08:50.219506+02:00 PPDLMREB04 kernel: [c003c1f0b970] 
[c06b16b8] udpv6_sendmsg+0x688/0xb20
2016-10-05T15:08:50.219511+02:00 PPDLMREB04 kernel: [c003c1f0baf0] 
[c064b30c] inet_sendmsg+0x9c/0x120
2016-10-05T15:08:50.219515+02:00 PPDLMREB04 kernel: [c003c1f0bb40] 
[c058eabc] sock_sendmsg+0xec/0x140
2016-10-05T15:08:50.219519+02:00 PPDLMREB04 kernel: [c003c1f0bc60] 
[c05921a8] SyS_sendto+0x108/0x150
2016-10-05T15:08:50.219524+02:00 PPDLMREB04 kernel: [c003c1f0bd80] 
[c0592240] SyS_send+0x50/0x70
2016-10-05T15:08:50.219530+02:00 PPDLMREB04 kernel: [c003c1f0bdc0] 
[c05933f0] SyS_socketcall+0x2a0/0x440
2016-10-05T15:08:50.219534+02:00 PPDLMREB04 kernel: [c003c1f0be30] 
[c000a17c] syscall_exit+0x0/0x7c
2016-10-05T15:08:50.219541+02:00 PPDLMREB04 kernel: Instruction dump:
2016-10-05T15:08:50.219547+02:00 PPDLMREB04 kernel: 6000 2fbf 419e0038 
395f0080 7c2004ac 7d205028 3129 7d20512d 
2016-10-05T15:08:50.219553+02:00 PPDLMREB04 kernel: 4

Re: [HACKERS] Problems with huge_pages and IBM Power8

2016-04-13 Thread reiner peterke

> On Apr 12, 2016, at 10:26 PM, Tom Lane  wrote:
> 
> Andres Freund  writes:
>> On 2016-04-12 21:58:14 +0200, reiner peterke wrote:
>>> Looking for some insight into this issue.  the error from the postgres
>>> log on ubuntu is below.  It apperas to be related to semephores.
> 
>> I've a bit of a hard time believing that this is related to huge pages.
> 
> I'm betting that's this:
> 
> http://www.postgresql.org/message-id/cak7teys9-o4bterbs3xuk2bffnnd55u2sm9j5r2fi7v6bhj...@mail.gmail.com
> 
>   regards, tom lane

Hi Tom,

You appear to have been correct.  :-)
The being led to believe it was connected to huge_pages turns out to have been 
a coincidence since we had been working with huge_pages.

Postgres happened to crash close enough to trying the huge_pages and did not 
appear to crash when they were disabled led us to make that conclusion.

After a bit more careful testing we found out that postgres did indeed crash 
even without huge_pages.  The setting in the link appears to have resolved this 
issue.

Thanks.

reiner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Problems with huge_pages and IBM Power8

2016-04-12 Thread reiner peterke

> On Apr 12, 2016, at 10:20 PM, Andres Freund  wrote:
> 
> On 2016-04-12 21:58:14 +0200, reiner peterke wrote:
>> Hi
>> 
>> We have been doing some testing with Postgres (9.5.2) compiled on a Power8 
>> running Centos 7
>> 
>> When working with huge_pages, we initially got this error.
>> 
>> munmap(0x3efbe400) failed: Invalid argument
> 
> *munmap*, not mmap failed? that's odd; because there the hugepagesize
> shouldn't have much of an influence. If something fails it should be the
> initial mmap.  
I’ll double check in the morning, but i did copy it from the log.

> Could you show a strace of a failed start with an
> unmodified postgres?

we didn’t have the error when not using huge_pages.

>> after a bit of investigation we noticed that hugepagesize is har coded
>> to 2MB
> 
> Note it's not actually hardcoded to some size. It's just about rounding
> the size to a multiple of 2MB due to an older kernel bug:
>   /*
>* Round up the request size to a suitable large value.
>*
>* Some Linux kernel versions are known to have a bug, which 
> causes
>* mmap() with MAP_HUGETLB to fail if the request size is not a
>* multiple of any supported huge page size. To work around 
> that, we
>* round up the request size to nearest 2MB. 2MB is the most 
> common
>* huge page page size on affected systems.
> 
> 
>> Going further, we tried testing hugepages also on Ubuntu 16.04, also on the 
>> power8.  On Ubuntu Postgres did not like the hugepages at all (set also to 
>> 16MB)  and consistently crashed.
> 
>> Looking for some insight into this issue.  the error from the postgres
>> log on ubuntu is below.  It apperas to be related to semephores.
> 
> I've a bit of a hard time believing that this is related to huge pages.

Well all i have at the moment is that when we disabled huge pages on the kernel 
level and then restarted postgres there were no additional crashes.
Unfortunately I cannot access the server now.  I will look further tomorrow.


> 
> 
> Greetings,
> 
> Andres Freund

Sincerely,

Reiner Peterke



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Problems with huge_pages and IBM Power8

2016-04-12 Thread reiner peterke
Hi

We have been doing some testing with Postgres (9.5.2) compiled on a Power8 
running Centos 7

When working with huge_pages, we initially got this error.

munmap(0x3efbe400) failed: Invalid argument

after a bit of investigation we noticed that hugepagesize is har coded to 2MB

src/backend/port/sysv_shmem.c (ligne 360)
 
...
int hugepagesize = 2 * 1024 * 1024;

But on the power they were configured to 16MB.  Recompiling to 16MB (8 * 1024 * 
1024) and we had no problems with the tests.

My initial questions are.

1 what is the hugepagesize hard coded to 2MB?
2 are there any side effect in setting it to 16MB?
3 since on the poer hugepages can have different values, would it be possible 
to have this value configurable?

Going further, we tried testing hugepages also on Ubuntu 16.04, also on the 
power8.  On Ubuntu Postgres did not like the hugepages at all (set also to 
16MB)  and consistently crashed.

Looking for some insight into this issue.  the error from the postgres log on 
ubuntu is below.
It apperas to be related to semephores. 

I don't have the compile optiona at the moment, I can provide those are other 
detais as needed.

Reiner

2016-04-12 12:26:42 CEST : 0 FATAL:  semctl(7864340, 14, SETVAL, 0) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  server process (PID 13352) exited with exit 
code 1
2016-04-12 12:26:42 CEST : 0 LOG:  terminating any other active server processes
2016-04-12 12:26:42 CEST facturation:system_dba 0 10.32.32.200WARNING:  
terminating connection because of crash of another server process
2016-04-12 12:26:42 CEST facturation:system_dba 0 10.32.32.200DETAIL:  The 
postmaster has commanded this server process to roll back the current 
transaction and exit, because another server process exited abnormally and 
possibly corrupted shared memory.
2016-04-12 12:26:42 CEST facturation:system_dba 0 10.32.32.200HINT:  In a 
moment you should be able to reconnect to the database and repeat your command.
2016-04-12 12:26:42 CEST postgres:admin 0 10.32.16.3WARNING:  terminating 
connection because of crash of another server process
2016-04-12 12:26:42 CEST postgres:admin 0 10.32.16.3DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and exit, 
because another server process exited abnormally and possibly corrupted shared 
memory.
2016-04-12 12:26:42 CEST postgres:admin 0 10.32.16.3HINT:  In a moment you 
should be able to reconnect to the database and repeat your command.
2016-04-12 12:26:42 CEST postgres:perf_user 0 ::1WARNING:  terminating 
connection because of crash of another server process
2016-04-12 12:26:42 CEST postgres:perf_user 0 ::1DETAIL:  The postmaster has 
commanded this server process to roll back the current transaction and exit, 
because another server process exited abnormally and possibly corrupted shared 
memory.
2016-04-12 12:26:42 CEST postgres:perf_user 0 ::1HINT:  In a moment you should 
be able to reconnect to the database and repeat your command.
2016-04-12 12:26:42 CEST : 0 WARNING:  terminating connection because of crash 
of another server process
2016-04-12 12:26:42 CEST : 0 DETAIL:  The postmaster has commanded this server 
process to roll back the current transaction and exit, because another server 
process exited abnormally and possibly corrupted shared memory.
2016-04-12 12:26:42 CEST : 0 HINT:  In a moment you should be able to reconnect 
to the database and repeat your command.
2016-04-12 12:26:42 CEST : 0 LOG:  all server processes terminated; 
reinitializing
2016-04-12 12:26:42 CEST : 0 LOG:  could not remove shared memory segment 
"/PostgreSQL.1612071802": No such file or directory
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7274497, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7307267, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7340036, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7372805, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7405574, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7438343, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7471112, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7503881, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7536650, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7569419, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7602188, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7634957, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7667726, 0, IPC_RMID, ...) failed: 
Invalid argument
2016-04-12 12:26:42 CEST : 0 LOG:  semctl(7700495, 0, IPC_RMID, ...) f