On 3/18/19 12:20 PM, Ryan wrote:
> Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and
> most of it shows itself in the planner.
>
> Issue #1:
>
> Quick test, when ANY DLE entry is deleted, the planner will error out
> with the message:
>
>> planner: assert: ep->estimate[0].level != -1 is false: file planner.c,
>> line 1313
>> Aborted
>
> Just remove any entries from a healthy set and run this command to test:
>
>> /usr/libexec/amanda/planner $SET --starttime now --log-filename
>> /tmp/planner.log
FWIW, I cannot reproduce this. Are there any other error messages?
>
> Issue #2
> In other, less well-defined situations, the planner will seemingly get
> to the end of the estimates, then get stuck in some
> read/close/fstat/futex/mmap loop:
>
>> [pid 2463] close(120)                 = -1 EBADF (Bad
>> file descriptor)
>> [pid 2461] read(4,Â
>> [pid 2463] close(121
>> [pid 2461] <... read resumed> " 636/udp                 Â
>> Â Â "...,
>> 4096) = 4096
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(122
>> [pid 2461] close(4
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... close resumed> )      = 0
>> [pid 2463] close(123
>> [pid 2461] munmap(0x7fd5e73d5000, 4096
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(124
>> [pid 2461] <... munmap resumed> )     = 0
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1
>> [pid 2463] close(125
>> [pid 2461] <... futex resumed> )      = 1
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(126
>> [pid 2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120
>>
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... write resumed> )      = 120
>> [pid 2463] close(127
>> [pid 2461] open("/etc/services", O_RDONLY|O_CLOEXEC
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] <... open resumed> )       = 4
>> [pid 2463] close(128
>> [pid 2461] fstat(4,Â
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2463] close(129
>> [pid 2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330,
>> ...}) = 0
>> [pid 2463] <... close resumed> )      = -1 EBADF (Bad file
>> descriptor)
>> [pid 2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0
>> [pid 2463] close(130
>
> While this is going, the planner grows to consume over 100GBytes(!!!) of
> memory before the OOM killer stops it. The problem I'm able to
> reproduce right now is only logging a bit about ambind, which drew my
> attention but seems nonsensical as this was working until days ago with
> no server changes, and another set on the same system is still working
> just fine:
>
>> Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner:
>> security_stream_close(0x7fd5cc0017e0)
>> Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner:
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner:
>> stream_client: Could not bind to port in range 512-1023.
>> Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner:
>> security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680
>> (BSDTCP) error=ambind: bind failed A: Permission denied)
>> Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner:
>> security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
>> Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner:
>> security_stream_close(0x7fd5dc001eb0)
>> Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner:
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner:
>> stream_client: Could not bind to port in range 512-1023.
>
>
> By doing nothing else but calling the planner from 3.5 instead of 3.5.1
> (which needs a handful of 3.5 libs as well), both these issues are
> resolved. I tracked it down even more, and making the old planner.c use
> newer libs gives it the same problems. The single problematic lib is
> libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the
> issue, for whatever that's worth to anyone here.
>
> I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...
>
I haven't seen the memory issue either.
--
Orion Poplawski
Manager of NWRA Technical Systems 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mit