> A) Is the LRN database located locally on the OpenSIPs box or is it remote?
We are using an F5 BIG-IP to proxy a pool of database servers. Opensips is showing two connection-related errors: Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: ERROR:db_mysql:db_mysql_connect: driver error(2013): Lost connection to MySQL server at 'reading authorization packet', system error: 110 Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: ERROR:db_mysql:db_mysql_new_connection: initial connect failed Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: ERROR:core:db_init_async: failed to open new DB connection on mysql://XXXX:[email protected]:0/ Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: INFO:db_mysql:db_mysql_async_raw_query: Failed to open new connection (current: 1 + 8). Running in sync mode! Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: INFO:db_mysql:switch_state_to_disconnected: disconnect event for 0x7f8903f16d10 Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: INFO:db_mysql:reset_all_statements: resetting all statements on connection: (0x7f8903f16bb0) 0x7f8903f16d10 Jun 4 10:41:48 TC-521 /usr/sbin/opensips[12318]: INFO:db_mysql:connect_with_retry: re-connected successful for 0x7f8903f16d10 Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: ERROR:db_mysql:db_mysql_connect: driver error(2003): Can't connect to MySQL server on '10.0.5.38' (110) Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: ERROR:db_mysql:db_mysql_new_connection: initial connect failed Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: ERROR:core:db_init_async: failed to open new DB connection on mysql://XXXX:[email protected]:0/ Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: INFO:db_mysql:db_mysql_async_raw_query: Failed to open new connection (current: 1 + 10). Running in sync mode! Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: INFO:db_mysql:switch_state_to_disconnected: disconnect event for 0x7f8903f16d10 Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: INFO:db_mysql:reset_all_statements: resetting all statements on connection: (0x7f8903f16bb0) 0x7f8903f16d10 Jun 4 10:44:29 TC-521 /usr/sbin/opensips[12342]: INFO:db_mysql:connect_with_retry: re-connected successful for 0x7f8903f16d10 MariaDB is also showing an error from its perspective: 2020-06-04 23:40:27 64783 [Warning] Aborted connection 64783 to db: 'unconnected' user: 'anonymous' host: '8.38.42.13' (Got timeout reading communication packets) > B) Have you tried only doing sync database queries? Async introduces some > overhead, and I'm not sure if it causes extra database connections to be > created. When using sync there is a connection per child process that stays > up. Using synchronous mode appeared to be causing context switching issues under heavy load. We specifically moved to async for this reason and that appeared to reduce the CPU load dramatically. From the docs: "Using the asynchronous, "suspend-resume" logic instead of forking a large number of processes in order to scale also has the advantage of optimizing system resource usage, increasing its maximal throughput. By requiring less processes to complete the same amount of work in the same amount of time, process context switching is minimized and overall CPU usage is improved. Less processes will also eat up less system memory." I've been tweaking each of the configuration settings I've mentioned, but without any clear path forward. Would 3.x provide any solutions? Is it possible to have too many children or timer partitions, and starve opensips with context switches? Would that cause connection issues? > C) Does the database have enough memory to contain the LRN and DNC datasets > fully in memory? The extra latency for the non-cache hits sent to the > database may stack up if the database has to hit disk. DB says query response time is like 0.001s and doesn't show any sign of strain. I'm not personally familiar with the TokuDB engine, but I'm lead to believe the entire dataset is in memory. I have two DBA triple checking things. It's possible we're hitting a max connections or open files limit that's set too low. Sometimes our peak hours include spikes as well. > D) How many child processes are you using now? If you are hitting 100% you > may need to increase them. Only one hits 100% initially, then they topple over after that. This seems to be related to the intermittent database connection errors. We'll see what raising the max connections and ulimits on the server does. I've also backed off on children and increased the async connection pool size to result in the same number of total maximum connections. Presumably this will reduce context switches and timer delays. > E) Are your memcached processes using heavy cpu? If you are caching multiple > lists, I've found it helps to use unique memcached instance per list. All of the various SIP dips are the same db stored procedure with many fields in the response. Those fields are cached as a CSV string, so any cached dip can be used by any other kind of dip. The same call is likely to use multiple dips, so we should only hit the DB once per call regardless of how many different dips we apply. > F) Look for memory related log messages. If the memory starts getting > exhausted you will see defrag messages. This will chew up available > computation cycles. Both opensips servers and the database have plenty of free memory. How do I know how much shared and process memory to use? I see warnings about the reactor size shrinking to a percentage of the process memory but have no idea what that implies. _______________________________________________ Users mailing list [email protected] http://lists.opensips.org/cgi-bin/mailman/listinfo/users
