Re: Serious regressions in planner.c 3.5 -> 3.5.1

2019-03-21 Thread Orion Poplawski
On 3/18/19 12:20 PM, Ryan wrote:
> Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and 
> most of it shows itself in the planner.
> 
> Issue #1:
> 
> Quick test, when ANY DLE entry is deleted, the planner will error out 
> with the message:
> 
>> planner: assert: ep->estimate[0].level != -1 is false: file planner.c, 
>> line 1313
>> Aborted
> 
> Just remove any entries from a healthy set and run this command to test:
> 
>> /usr/libexec/amanda/planner $SET --starttime now --log-filename 
>> /tmp/planner.log

FWIW, I cannot reproduce this.  Are there any other error messages?

> 
> Issue #2
> In other, less well-defined situations, the planner will seemingly get 
> to the end of the estimates, then get stuck in some 
> read/close/fstat/futex/mmap loop:
> 
>> [pid  2463] close(120)                  = -1 EBADF (Bad 
>> file descriptor)
>> [pid  2461] read(4,  
>> [pid  2463] close(121 
>> [pid  2461] <... read resumed> " 636/udp                  
>>    "..., 
>> 4096) = 4096
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(122 
>> [pid  2461] close(4 
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] <... close resumed> )       = 0
>> [pid  2463] close(123 
>> [pid  2461] munmap(0x7fd5e73d5000, 4096 
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(124 
>> [pid  2461] <... munmap resumed> )      = 0
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 
>> [pid  2463] close(125 
>> [pid  2461] <... futex resumed> )       = 1
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(126 
>> [pid  2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120 
>> 
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] <... write resumed> )       = 120
>> [pid  2463] close(127 
>> [pid  2461] open("/etc/services", O_RDONLY|O_CLOEXEC 
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] <... open resumed> )        = 4
>> [pid  2463] close(128 
>> [pid  2461] fstat(4,  
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2463] close(129 
>> [pid  2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330, 
>> ...}) = 0
>> [pid  2463] <... close resumed> )       = -1 EBADF (Bad file 
>> descriptor)
>> [pid  2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 
>> [pid  2463] close(130 
> 
> While this is going, the planner grows to consume over 100GBytes(!!!) of 
> memory before the OOM killer stops it.  The problem I'm able to 
> reproduce right now is only logging a bit about ambind, which drew my 
> attention but seems nonsensical as this was working until days ago with 
> no server changes, and another set on the same system is still working 
> just fine:
> 
>> Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner: 
>> security_stream_close(0x7fd5cc0017e0)
>> Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner: 
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner: 
>> stream_client: Could not bind to port in range 512-1023.
>> Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner: 
>> security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 
>> (BSDTCP) error=ambind: bind failed A: Permission denied)
>> Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner: 
>> security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
>> Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner: 
>> security_stream_close(0x7fd5dc001eb0)
>> Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner: 
>> ambind failed: ambind: bind failed A: Permission denied
>> Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner: 
>> stream_client: Could not bind to port in range 512-1023.
> 
> 
> By doing nothing else but calling the planner from 3.5 instead of 3.5.1 
> (which needs a handful of 3.5 libs as well), both these issues are 
> resolved.  I tracked it down even more, and making the old planner.c use 
> newer libs gives it the same problems.  The single problematic lib is 
> libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the 
> issue, for whatever that's worth to anyone here.
> 
> I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...
> 

I haven't seen the memory issue either.

-- 
Orion Poplawski
Manager of NWRA Technical Systems  720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mit

Re: Serious regressions in planner.c 3.5 -> 3.5.1

2019-03-19 Thread Stefan G. Weichinger
Am 18.03.19 um 19:20 schrieb Ryan:
> Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and
> most of it shows itself in the planner.

pls also inform

[email protected]

and file an issue at github:

https://github.com/zmanda/amanda/issues

Unfortunately the project activity has dropped to zero for months, all
amanda-users would be happy to see fixes and development ...

Stefan


Serious regressions in planner.c 3.5 -> 3.5.1

2019-03-18 Thread Ryan
Amanda 3.5.1 (RHEL7 community RPM) has been terribly buggy for us, and 
most of it shows itself in the planner.


Issue #1:

Quick test, when ANY DLE entry is deleted, the planner will error out 
with the message:


planner: assert: ep->estimate[0].level != -1 is false: file planner.c, 
line 1313

Aborted


Just remove any entries from a healthy set and run this command to test:

/usr/libexec/amanda/planner $SET --starttime now --log-filename 
/tmp/planner.log



Issue #2
In other, less well-defined situations, the planner will seemingly get 
to the end of the estimates, then get stuck in some 
read/close/fstat/futex/mmap loop:



[pid  2463] close(120)  = -1 EBADF (Bad file descriptor)
[pid  2461] read(4,  
[pid  2463] close(121 
[pid  2461] <... read resumed> " 636/udp "..., 
4096) = 4096

[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2463] close(122 
[pid  2461] close(4 
[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2461] <... close resumed> )   = 0
[pid  2463] close(123 
[pid  2461] munmap(0x7fd5e73d5000, 4096 
[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2463] close(124 
[pid  2461] <... munmap resumed> )  = 0
[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2461] futex(0x7fd5e0cb9240, FUTEX_WAKE_PRIVATE, 1 
[pid  2463] close(125 
[pid  2461] <... futex resumed> )   = 1
[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2463] close(126 
[pid  2461] write(10, "Mon Mar 18 10:43:07.568314681 20"..., 120 


[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2461] <... write resumed> )   = 120
[pid  2463] close(127 
[pid  2461] open("/etc/services", O_RDONLY|O_CLOEXEC 
[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2461] <... open resumed> )    = 4
[pid  2463] close(128 
[pid  2461] fstat(4,  
[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2463] close(129 
[pid  2461] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=670330, 
...}) = 0

[pid  2463] <... close resumed> )   = -1 EBADF (Bad file descriptor)
[pid  2461] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 

[pid  2463] close(130 


While this is going, the planner grows to consume over 100GBytes(!!!) of 
memory before the OOM killer stops it.  The problem I'm able to 
reproduce right now is only logging a bit about ambind, which drew my 
attention but seems nonsensical as this was working until days ago with 
no server changes, and another set on the same system is still working 
just fine:


Mon Mar 18 10:43:02.454918586 2019: pid 2453: thd-0x14ff600: planner: 
security_stream_close(0x7fd5cc0017e0)
Mon Mar 18 10:43:02.499910629 2019: pid 2453: thd-0x1682000: planner: 
ambind failed: ambind: bind failed A: Permission denied
Mon Mar 18 10:43:02.500014522 2019: pid 2453: thd-0x1682000: planner: 
stream_client: Could not bind to port in range 512-1023.
Mon Mar 18 10:43:02.500092361 2019: pid 2453: thd-0x1682000: planner: 
security_seterror(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 
(BSDTCP) error=ambind: bind failed A: Permission denied)
Mon Mar 18 10:43:02.500372813 2019: pid 2453: thd-0x14ff600: planner: 
security_close(handle=0x7fd5dc001500, driver=0x7fd5e6f32680 (BSDTCP))
Mon Mar 18 10:43:02.500754962 2019: pid 2453: thd-0x14ff600: planner: 
security_stream_close(0x7fd5dc001eb0)
Mon Mar 18 10:43:02.519951961 2019: pid 2453: thd-0x1682050: planner: 
ambind failed: ambind: bind failed A: Permission denied
Mon Mar 18 10:43:02.520019407 2019: pid 2453: thd-0x1682050: planner: 
stream_client: Could not bind to port in range 512-1023.



By doing nothing else but calling the planner from 3.5 instead of 3.5.1 
(which needs a handful of 3.5 libs as well), both these issues are 
resolved.  I tracked it down even more, and making the old planner.c use 
newer libs gives it the same problems.  The single problematic lib is 
libamserver-3.5.1.so... Changing just that lib back to 3.5 fixes the 
issue, for whatever that's worth to anyone here.


I'm thinking a complete downgrade from 3.5.1 to 3.5 is called for...