On Tue, Jul 24, 2018 at 10:17 AM Martin Cigorraga < mailing.lists.forwar...@gmail.com> wrote:
> Hi all, > > I'm not a DS guys but an Ops guy instead fairly new to Zookeeper and > Kafka; already spent a ton of time trying to understand what's going > on here but so far I just scratched the surface of the issue without > having a complete understanding of it. > > The issue: > We have a cluster of Kafka with a Zookeeper coordinating them all; now > the Zookeeper box log file (/var/log/zookeeper/zookeeper.out) is being > fill with this error: > > 2018-07-24 08:23:43,178 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - > Exception causing close of session 0x3000017d522019e: Len error > 1119502 > Hi Martin. ZK has a configured limit for message lengths - this is being exceeded for some reason on the session. Likely Kafka is sending too large a dataset, or perhaps you have too many child znodes - unfortunately both of these questions are Kafka specific (acting as a zk client) issues that I don't have insight on. See the ZK admin guide for more insight on the config - jute.maxbuffer is the setting here: http://zookeeper.apache.org/doc/r3.4.13/zookeeperAdmin.html#Unsafe+Options Notice you are exceeding that limit "1119502". > 2018-07-24 08:23:43,178 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - > Closed socket connection for client /10.20.3.16:40618 which had > sessionid 0x3000017d522019e > 2018-07-24 08:23:43,492 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - > Accepted socket connection from /10.20.2.24:49410 > 2018-07-24 08:23:43,492 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@931] - > Client attempting to renew session 0x3000017d52200b1 at > /10.20.2.24:49410 > 2018-07-24 08:23:43,493 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@683] - > Established session 0x3000017d52200b1 with negotiated timeout 30000 > for client /10.20.2.24:49410 > > // > > The IP 10.20.2.24 refers to the box where we run Kafka Manager and > Kafka Offset Monitor. Looking into the Kafka Offset Monitor logs I can > see: > > 09:06:56 INFO ClientCnxn:975 - Opening socket connection to server > ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to > authenticate using SASL (unknown error) > 09:06:58 INFO ClientCnxn:975 - Opening socket connection to server > ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to > authenticate using SASL (unknown error) > 09:06:59 INFO ClientCnxn:975 - Opening socket connection to server > ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to > authenticate using SASL (unknown error) > 09:07:03 INFO ClientCnxn:975 - Opening socket connection to server > ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to > authenticate using SASL (unknown error) > 09:07:07 INFO ClientCnxn:975 - Opening socket connection to server > ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to > authenticate using SASL (unknown error) > 09:07:07 INFO ClientCnxn:852 - Socket connection established to > ip-10-20-1-6.ec2.internal/10.20.1.6:2181, initiating session > 09:07:17 INFO ClientCnxn:975 - Opening socket connection to server > ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to > authenticate using SASL (unknown error) > 09:07:17 INFO ClientCnxn:852 - Socket connection established to > ip-10-20-1-6.ec2.internal/10.20.1.6:2181, initiating session > 09:07:17 INFO ClientCnxn:1235 - Session establishment complete on > server ip-10-20-1-6.ec2.internal/10.20.1.6:2181, sessionid = > 0x20000178e5d0083, negotiated timeout = 30000 > > Are you using auth? e.g. Kerberos? Basically we attempt to connect using SASL, if the attempt fails we retry without it. (and as a result no auth) > With 10.20.1.6 being the Zookeeper box. There's definitely something > fishy going on here but I can't untanggle what is it. > > I'd thankfuly appreciate if you guys can point me into the right > direction to where can I learn more about these errors. > You might check the Kafka docs to see if they talk about the jute.maxbuffer issue. I'm assuming they see this on larger deployments and must discuss it somewhere. Regards, Patrick > Thanks! >