Re: Serious regressions in planner.c 3.5 -> 3.5.1
On 3/18/19 12:20 PM, Ryan wrote:
> Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and
> most of it shows itself in the planner.
>
> Issue #1:
>
> Quick test, when ANY DLE entry is deleted, the planner will error out
> with the message:
>
>> planner: assert: ep->estimate[0].level != -1 is false: file planner.c,
>> line 1313
>> Aborted
>
> Just remove any entries from a healthy set and run this command to test:
>
>> /usr/libexec/amanda/planner $SET --starttime now --log-filename
>> /tmp/planner.log
FWIW, I cannot reproduce this. Are there any other error messages?
>
> Issue #2
> In other, less well-defined situations, the planner will seemingly get
> to the end of the estimates, then get stuck in some
> read/close/fstat/futex/mmap loop:
>
>> [pid 2463] close(120)                 = -1 EBADF (Bad
>> file descriptor)
>> [pid 2461] read(4,Â
>> [pid 2463] close(121
>> [pid 2461] <... read resumed> " 636/udp                 Â
>> Â Â "...,
>> 4096) = 4096
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(122
>> [pid 2461] close(4
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... close resumed> )      = 0
>> [pid 2463] close(123
>> [pid 2461] munmap(0x7fd5e73d5000, 4096
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(124
>> [pid 2461] <... munmap resumed> )     = 0
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1
>> [pid 2463] close(125
>> [pid 2461] <... futex resumed> )      = 1
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(126
>> [pid 2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120
>>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... write resumed> )      = 120
>> [pid 2463] close(127
>> [pid 2461] open("/etc/services", O_RDONLY|O_CLOEXEC
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... open resumed> )       = 4
>> [pid 2463] close(128
>> [pid 2461] fstat(4,Â
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(129
>> [pid 2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330,
>> ...}) = 0
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0
>> [pid 2463] close(130
>
> While this is going, the planner grows to consume over 100GBytes(!!!) of
> memory before the OOM killer stops it. The problem I'm able to
> reproduce right now is only logging a bit about ambind, which drew my
> attention but seems nonsensical as this was working until days ago with
> no server changes, and another set on the same system is still working
> just fine:
>
>> Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner:
>> security_stream_close(0x7fd5cc0017e0)
>> Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner:
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner:
>> stream_client: Could not bind to port in range 512-1023.
>> Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner:
>> security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680
>> (BSDTCP) error=ambind: bind failed A: Permission denied)
>> Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner:
>> security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
>> Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner:
>> security_stream_close(0x7fd5dc001eb0)
>> Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner:
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner:
>> stream_client: Could not bind to port in range 512-1023.
>
>
> By doing nothing else but calling the planner from 3.5 instead of 3.5.1
> (which needs a handful of 3.5 libs as well), both these issues are
> resolved. I tracked it down even more, and making the old planner.c use
> newer libs gives it the same problems. The single problematic lib is
> libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the
> issue, for whatever that's worth to anyone here.
>
> I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...
>
I haven't seen the memory issue either.
--
Orion Poplawski
Manager of NWRA Technical Systems 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mit
Re: Serious regressions in planner.c 3.5 -> 3.5.1
Am 18.03.19 um 19:20 schrieb Ryan: > Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and > most of it shows itself in the planner. pls also inform [email protected] and file an issue at github: https://github.com/zmanda/amanda/issues Unfortunately the project activity has dropped to zero for months, all amanda-users would be happy to see fixes and development ... Stefan
Serious regressions in planner.c 3.5 -> 3.5.1
Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and
most of it shows itself in the planner.
Issue #1:
Quick test, when ANY DLE entry is deleted, the planner will error out
with the message:
planner: assert: ep->estimate[0].level != -1 is false: file planner.c,
line 1313
Aborted
Just remove any entries from a healthy set and run this command to test:
/usr/libexec/amanda/planner $SET --starttime now --log-filename
/tmp/planner.log
Issue #2
In other, less well-defined situations, the planner will seemingly get
to the end of the estimates, then get stuck in some
read/close/fstat/futex/mmap loop:
[pid 2463] close(120) = -1 EBADF (Bad file descriptor)
[pid 2461] read(4,
[pid 2463] close(121
[pid 2461] <... read resumed> " 636/udp "...,
4096) = 4096
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(122
[pid 2461] close(4
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] <... close resumed> ) = 0
[pid 2463] close(123
[pid 2461] munmap(0x7fd5e73d5000, 4096
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(124
[pid 2461] <... munmap resumed> ) = 0
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1
[pid 2463] close(125
[pid 2461] <... futex resumed> ) = 1
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(126
[pid 2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] <... write resumed> ) = 120
[pid 2463] close(127
[pid 2461] open("/etc/services", O_RDONLY|O_CLOEXEC
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] <... open resumed> ) = 4
[pid 2463] close(128
[pid 2461] fstat(4,
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2463] close(129
[pid 2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330,
...}) = 0
[pid 2463] <... close resumed> ) = -1 EBADF (Bad file descriptor)
[pid 2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0
[pid 2463] close(130
While this is going, the planner grows to consume over 100GBytes(!!!) of
memory before the OOM killer stops it. The problem I'm able to
reproduce right now is only logging a bit about ambind, which drew my
attention but seems nonsensical as this was working until days ago with
no server changes, and another set on the same system is still working
just fine:
Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner:
security_stream_close(0x7fd5cc0017e0)
Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner:
ambind failed: ambind: bind failed A: Permission denied
Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner:
stream_client: Could not bind to port in range 512-1023.
Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner:
security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680
(BSDTCP) error=ambind: bind failed A: Permission denied)
Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner:
security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner:
security_stream_close(0x7fd5dc001eb0)
Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner:
ambind failed: ambind: bind failed A: Permission denied
Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner:
stream_client: Could not bind to port in range 512-1023.
By doing nothing else but calling the planner from 3.5 instead of 3.5.1
(which needs a handful of 3.5 libs as well), both these issues are
resolved. I tracked it down even more, and making the old planner.c use
newer libs gives it the same problems. The single problematic lib is
libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the
issue, for whatever that's worth to anyone here.
I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...
