syrcx created an issue (kamailio/kamailio#4433)

### Description
We observe intermittent Redis failures in `ndb_redis` when using Redis Cluster 
with TLS and password (AWS ElastiCache). Errors often appear right after a 
MOVED redirection and during reconnect:

- “Resource temporarily unavailable” while reconnecting to the redirected node
- Subsequent “MOVED … host:port”
- The command fails (e.g., GET)

>From reading the module code, there seem to be two distinct issues:

- In cluster mode with `allow_dynamic_nodes=1`, when handling MOVED/ASK, 
`ndb_redis` adds a new server entry including only `name/addr/port`. It does 
not inherit `pass`, `tls`, `db`, or timeouts, so redirected connections are 
attempted without TLS and without AUTH.
- The `tls` parameter parsing overwrites the `pass` buffer, corrupting a 
previously parsed password when `tls=1` is present.

This explains the intermittent nature: only keys whose slots are redirected to 
“new” nodes (created dynamically without TLS/auth) fail.

### Environment
- Kamailio version: ghcr.io/sergey-safarov/kamailio-ci:master-alpine, which is 
from https://github.com/kamailio/kamailio/issues/4401#issuecomment-3352145843
- Redis: AWS ElastiCache Redis Cluster with TLS and AUTH

### Configuration (excerpt)
```cfg
loadmodule "ndb_redis.so"

modparam("ndb_redis", "init_without_redis", 1)
modparam("ndb_redis", "connect_timeout", 15000)
modparam("ndb_redis", "cmd_timeout", 5000)
modparam("ndb_redis", "cluster", 1)
modparam("ndb_redis", "allow_dynamic_nodes", 1)
modparam("ndb_redis", "ca_path", "/etc/ssl/certs")
modparam("ndb_redis", "server", "name=con1;addr=...;port=6379;pass=...;tls=1")
```

### Steps to Reproduce
1) Configure `ndb_redis` for Redis Cluster with TLS and password (as above), 
set `cluster=1` and `allow_dynamic_nodes=1`.
2) Execute commands across keys that map to different hash slots (to trigger 
MOVED).
3) Observe that some redirected connections report “Resource temporarily 
unavailable” and the command fails.

### Actual Behavior
Logs (representative):
```
ERROR: ndb_redis [redis_client.c:677]: redisc_reconnect_server(): error 
communicating with redis server 
[shared-us-east-1-prod-0001-001.shared-us-east-1-prod.qonwyy.use1.cache.amazonaws.com:6379]
 
(shared-us-east-1-prod-0001-001.shared-us-east-1-prod.qonwyy.use1.cache.amazonaws.com:6379/0):
 Resource temporarily unavailable
ERROR: ndb_redis [redis_client.c:686]: redisc_reconnect_server(): failed to 
connect to redis server 
[shared-us-east-1-prod-0001-001.shared-us-east-1-prod.qonwyy.use1.cache.amazonaws.com:6379]
 
(shared-us-east-1-prod-0001-001.shared-us-east-1-prod.qonwyy.use1.cache.amazonaws.com:6379/0)
ERROR: ndb_redis [redis_client.c:998]: check_cluster_reply(): failed connecting 
to the new server with name: 
shared-us-east-1-prod-0001-001.shared-us-east-1-prod.qonwyy.use1.cache.amazonaws.com:6379
ERROR: ndb_redis [redis_client.c:1143]: redisc_exec(): Redis error:MOVED 10021 
shared-us-east-1-prod-0001-001.shared-us-east-1-prod.qonwyy.use1.cache.amazonaws.com:6379
ERROR: <script>: Redis GET failed for key bye_headers:...
```

### Expected Behavior
MOVED redirections should connect using the same TLS, AUTH, DB, and timeout 
settings as the original server; the command should then be retried 
successfully.

### Code references

- MOVED handling creates new node specs without TLS/AUTH/DB/timeouts, then 
attempts to connect:
```911:1021:kamailio/src/modules/ndb_redis/redis_client.c
int check_cluster_reply(redisReply *reply, redisc_server_t **rsrv)
{
        ...
                if((reply->len > 7) && (strncmp(reply->str, "MOVED", 5) == 0)) {
                        ...
                        rsrv_new = redisc_get_server(&name);
                        if(rsrv_new) {
                                *rsrv = rsrv_new;
                                return 1;
                        } else if(redis_allow_dynamic_nodes_param) {
                                ...
                                server_len = snprintf(spec_new, 
sizeof(spec_new) - 1,
                                                "name=%.*s;addr=%.*s;port=%i", 
name.len, name.s,
                                                addr.len, addr.s, port);
                                ...
                                if(redisc_add_server(server_new) == 0) {
                                        rsrv_new = redisc_get_server(&name);
                                        if(rsrv_new) {
                                                *rsrv = rsrv_new;
                                                /* Need to connect to the new 
server now */
                                                
if(redisc_reconnect_server(rsrv_new) == 0) {
                                                        ...
                                                        return 1;
                                                } else {
                                                        LM_ERR("failed 
connecting to the new server with "
                                                                   "name: 
%.*s\n", name.len, name.s);
                                                        return 0;
                                                }
```

- TLS parameter overwrites `pass` buffer (init):
```154:160:kamailio/src/modules/ndb_redis/redis_client.c
} else if(pit->name.len == 3
                  && strncmp(pit->name.s, "tls", 3) == 0) {
        snprintf(pass, sizeof(pass) - 1, "%.*s", pit->body.len,
                        pit->body.s);
        if(str2int(&pit->body, &enable_ssl) < 0)
                enable_ssl = 0;
```

- Same bug in reconnect:
```526:531:kamailio/src/modules/ndb_redis/redis_client.c
} else if(pit->name.len == 3 && strncmp(pit->name.s, "tls", 3) == 0) {
        snprintf(pass, sizeof(pass) - 1, "%.*s", pit->body.len, pit->body.s);
        if(str2int(&pit->body, &enable_ssl) < 0)
                enable_ssl = 0;
```

- Reconnect error reporting path (where “Resource temporarily unavailable” is 
logged):
```670:681:kamailio/src/modules/ndb_redis/redis_client.c
err2:
        if(sock != 0) {
                LM_ERR("error communicating with redis server [%.*s]"
                           " (unix:%s db:%d): %s\n",
                                rsrv->sname->len, rsrv->sname->s, 
unix_sock_path, db,
                                rsrv->ctxRedis->errstr);
        } else {
                LM_ERR("error communicating with redis server [%.*s] 
(%s:%d/%d): %s\n",
                                rsrv->sname->len, rsrv->sname->s, addr, port, 
db,
                                rsrv->ctxRedis->errstr);
        }
```

- Minor robustness issue: sentinel master address copy uses `len + 1`:
```579:583:kamailio/src/modules/ndb_redis/redis_client.c
strncpy(addr, res->element[0]->str, res->element[0]->len + 1);
```

### Proposed fixes
- When adding dynamic nodes after MOVED/ASK, include inherited settings in 
`spec_new` (at least `pass`, `db`, `tls`, `connect_timeout`, `command_timeout`) 
or duplicate/copy the parent server’s `attrs`.
- In both init and reconnect, do not write to `pass` in the `tls` parsing 
branch; only parse `enable_ssl`.
- Call `redisSetTimeout()` immediately after creating the hiredis context and 
before `AUTH`/`PING`/`SELECT`.
- Bound the sentinel master address copy with a proper length and explicit NUL.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/4433
You are receiving this because you are subscribed to this thread.

Message ID: <kamailio/kamailio/issues/[email protected]>
_______________________________________________
Kamailio - Development Mailing List -- [email protected]
To unsubscribe send an email to [email protected]
Important: keep the mailing list in the recipients, do not reply only to the 
sender!

Reply via email to