As mentioned in BUG-624: https://issues.apache.org/jira/browse/ZOOKEEPER-624
The C Client cause core dump when receive error data from Zookeeper
Server. And the bug seems didn't fix well. The gdb information is like:
do_io thread:
#0 0x00000039fb030265 in raise () from /lib64/libc.so.6
#1 0x00000039fb031d10 in abort () from /lib64/libc.so.6
#2 0x00000039fb06a84b in __libc_message () from /lib64/libc.so.6
#3 0x00000039fb0722ef in _int_free () from /lib64/libc.so.6
#4 0x00000039fb07273b in free () from /lib64/libc.so.6
#5 0x00002b0afd755dd1 in deallocate_String (s=0x5a490f40) at
src/recordio.c:29
#6 0x00002b0afd754ade in zookeeper_process (zh=0x131e3870, events=<value
optimized out>) at src/zookeeper.c:2071
#7 0x00002b0afd75b2ef in do_io (v=<value optimized out>) at
src/mt_adaptor.c:310
#8 0x00000039fb8064a7 in start_thread () from /lib64/libpthread.so.0
#9 0x00000039fb0d3c2d in clone () from /lib64/libc.so.6
create_node thread:
#0 0x00000039fb80ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00002b0afd75af5c in wait_sync_completion (sc=0x131e4c90) at
src/mt_adaptor.c:82
#2 0x00002b0afd751750 in zoo_create (zh=0x131e3870, path=0x13206fa8
"/jsq/zr2/hb/10.250.8.139:8102",
value=0x131e86a8
"\n\021\061\060.250.8.139:8102\022\035/home/shaoqiang/workdir2/qrs/\030\001
\001*%\n\020\n",
valuelen=102, acl=0x2b0afd961700, flags=1, path_buffer=0x0,
path_buffer_len=0) at src/zookeeper.c:3028
The source of zookeeper.c:
case COMPLETION_STRING:
LOG_DEBUG(("Calling COMPLETION_STRING for xid=%#x rc=%d",
cptr->xid, rc));
if (rc == 0) {
struct CreateResponse res;
int len;
deserialize_CreateResponse(ia, "reply", &res);
len = strlen(res.path) + 1;
if (len > sc->u.str.str_len) {
len = sc->u.str.str_len;
}
if (len > 0) {
memcpy(sc->u.str.str, res.path, len - 1);
sc->u.str.str[len - 1] = '\0';
}
deallocate_CreateResponse(&res); (this cause core dump)
}
break;
The source of recordio.c:
int ia_deserialize_string(struct iarchive *ia, const char *name, char **s)
{
struct buff_struct *priv = ia->priv;
int32_t len;
int rc = ia_deserialize_int(ia, "len", &len);
if (rc < 0)
return rc;
if ((priv->len - priv->off) < len) {
return -E2BIG;
}
if (len < 0) {
return -EINVAL;
}
*s = malloc(len+1);
if (!*s) {
return -ENOMEM;
}
memcpy(*s, priv->buffer+priv->off, len);
(*s)[len] = '\0';
priv->off += len;
return 0;
}
the variable len is set by ia_deserialize_int, and the returned value is
-1. (Why server returned -1? It should be the length of the path the
client just created. If the create operation in server didn't sucessed,
the error code returned by server should be non zero, but actually the
error code in reply header is zero.) So *s = malloc(len+1) is never
done. In deallocate_CreateResponse, res->path isn't initialized but we
try to free it.
It seems zookeeper server also has some bugs.
In DataTree.java, the function: public ProcessTxnResult
processTxn(TxnHeader header, Record txn)
try {
rc.clientId = header.getClientId();
rc.cxid = header.getCxid();
rc.zxid = header.getZxid();
rc.type = header.getType();
rc.err = 0;
if (rc.zxid > lastProcessedZxid) {
lastProcessedZxid = rc.zxid;
}
switch (header.getType()) {
case OpCode.create:
CreateTxn createTxn = (CreateTxn) txn;
debug = "Create transaction for " + createTxn.getPath();
createNode(
createTxn.getPath(),
createTxn.getData(),
createTxn.getAcl(),
createTxn.getEphemeral() ? header.getClientId() : 0,
header.getZxid(), header.getTime());
rc.path = createTxn.getPath();
break;
What if createNode throws out an exception? The operation didn't
successes, but the rc.err didn't change, it had been set to zero before
we actually do something.
By the way, this core dump is hard to represent, and I guess the bad
network may be one of the reasons.