Hi again. Guys I need some hints from you, at least personally, so please take a look at my first message if possible. Thanks in advance and have a nice day.
2018-07-16 10:34 GMT+03:00 Donat Zenichev <[email protected]>: > Hi community, recently I've got the problems with sudden termination of > the daemon. > > This bunch of similar rows seems to be some timer job that takes longer > than it was supposed to be: > Jul 14 02:13:10 some-machine opensips[25060]: message repeated 3 times: [ > WARNING:core:timer_ticker: timer task <nh-timer> already schedualed for > 10040133060 ms (now 10040134050 ms), it may overlap..] > Jul 14 08:45:29 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404100 ms (now > 10063404200 ms), it may overlap.. > Jul 14 08:45:29 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404200 ms (now > 10063404300 ms), it may overlap.. > Jul 14 08:45:29 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404300 ms (now > 10063404400 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404400 ms (now > 10063404500 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404500 ms (now > 10063404600 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404600 ms (now > 10063404700 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404700 ms (now > 10063404800 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404800 ms (now > 10063404900 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063404900 ms (now > 10063405010 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063405010 ms (now > 10063405110 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25077]: CRITICAL:core:receive_fd: > EOF on 41 > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063405110 ms (now > 10063405210 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063405210 ms (now > 10063405310 ms), it may overlap.. > Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063405310 ms (now > 10063405410 ms), it may overlap.. > Jul 14 08:45:31 some-machine opensips[25060]: WARNING:core:utimer_ticker: > utimer task <tm-utimer> already schedualed for 10063405410 ms (now > 10063405510 ms), it may overlap.. > > Then we have the following row, that means that child process 25075 was > killed by SIGSEGV signal (probably this process tried to appeal to > non-existent memory cell?). > Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs: child > process 25075 exited by a signal 11 > > I can see in the logs lots of rows about the process 25075 that was > treating the nathelper module: > Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:msg_send: > send() for proto 2 failed > Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:nh_timer: > sip msg_send failed > Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:msg_send: > send() for proto 2 failed > Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:nh_timer: > sip msg_send failed > > Then we have a core dump generated > Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs: core > was generated > > Then we have a SIGCHLD > Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs: > terminating due to SIGCHLD > > I'm not sure how this is considered by opensips daemon (afaik a root > process can ignore such requests) but seems that it caused a termination of > the root process as well. > Because just after that we see, that a SIGTERM (killing) for all the > forked processes had started: > Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs: > terminating due to SIGCHLD > Jul 14 08:45:31 some-machine opensips[25074]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25077]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25070]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25064]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25072]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25062]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25060]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25056]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25067]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25065]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25061]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25063]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25066]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25068]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25076]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25069]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25071]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25073]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25058]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25059]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25057]: INFO:core:sig_usr: signal 15 > received > Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:cleanup: cleanup > > And it's bizarre, since just one killed sub-process triggered such general > termination. > > Since we see that, the process 25075 was related to nathelper module, the > problem perhaps is around that. > I've read several bug articles on the similar issues, and one of the > answers (from Razvan Crainea) seems to be relevant to our story: > > "OpenSIPS pings all users in a single timer run. If there are a lot of > users, this might take a lot, especially if users are using TCP, and > some of them have a stale connection. > Increasing the natping_partitions partitions will basically split all > the users in smaller sets, and will spread them among multiple job runs, > thus multiple processes. Therefore increasing this parameter will > definitely help you. > The range is not a magic value . It depends on the amount of users, > bandwidth, etc. It makes sense to set this parameter from 1 to the > maximum number of processes provisioned - anything else will probably > make timer jobs overlap. " > > For now we do have this parameter set to: > modparam("nathelper", "natping_partitions", 4) > > So it performs nat pinging separately, each part has 1/4 part of users. > > Actually it seems pretty enough to work properly, but seems that nathelper > accumulates the "dead" users (that are not available anymore) and tries to > ping them. But it's just my assumption, because we have to many rows with > the same error (not only this time and with this child process, they are > spread all a log), I mean like that: > Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:msg_send: > send() for proto 2 failed > Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:nh_timer: > sip msg_send failed > > And parameters that can have an impact on this situation: > modparam("nathelper", "natping_interval", 30) > modparam("nathelper", "natping_partitions", 4) > > My idea is to increase the partitions, probably to value "8" ? > And probably decrease the interval to 10 seconds. > > What would you advice? > > -- > -- > BR, Donat Zenichev > > -- -- BR, Donat Zenichev
_______________________________________________ Users mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/users
