Re: Trying to figure out what these errors mean

Patrick Hunt Wed, 25 Jul 2018 12:54:15 -0700

On Tue, Jul 24, 2018 at 10:17 AM Martin Cigorraga <
mailing.lists.forwar...@gmail.com> wrote:


> Hi all,
>
> I'm not a DS guys but an Ops guy instead fairly new to Zookeeper and
> Kafka; already spent a ton of time trying to understand what's going
> on here but so far I just scratched the surface of the issue without
> having a complete understanding of it.
>
> The issue:
> We have a cluster of Kafka with a Zookeeper coordinating them all; now
> the Zookeeper box log file (/var/log/zookeeper/zookeeper.out) is being
> fill with this error:
>
> 2018-07-24 08:23:43,178 [myid:3] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] -
> Exception causing close of session 0x3000017d522019e: Len error
> 1119502
>

Hi Martin.

ZK has a configured limit for message lengths - this is being exceeded for
some reason on the session. Likely Kafka is sending too large a dataset, or
perhaps you have too many child znodes - unfortunately both of these
questions are Kafka specific (acting as a zk client) issues that I don't
have insight on.

See the ZK admin guide for more insight on the config - jute.maxbuffer is
the setting here:
http://zookeeper.apache.org/doc/r3.4.13/zookeeperAdmin.html#Unsafe+Options
Notice you are exceeding that limit "1119502".



> 2018-07-24 08:23:43,178 [myid:3] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] -
> Closed socket connection for client /10.20.3.16:40618 which had
> sessionid 0x3000017d522019e
> 2018-07-24 08:23:43,492 [myid:3] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
> Accepted socket connection from /10.20.2.24:49410
> 2018-07-24 08:23:43,492 [myid:3] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@931] -
> Client attempting to renew session 0x3000017d52200b1 at
> /10.20.2.24:49410
> 2018-07-24 08:23:43,493 [myid:3] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@683] -
> Established session 0x3000017d52200b1 with negotiated timeout 30000
> for client /10.20.2.24:49410
>
> //
>
> The IP 10.20.2.24 refers to the box where we run Kafka Manager and
> Kafka Offset Monitor. Looking into the Kafka Offset Monitor logs I can
> see:
>
> 09:06:56 INFO  ClientCnxn:975 - Opening socket connection to server
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 09:06:58 INFO  ClientCnxn:975 - Opening socket connection to server
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 09:06:59 INFO  ClientCnxn:975 - Opening socket connection to server
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 09:07:03 INFO  ClientCnxn:975 - Opening socket connection to server
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 09:07:07 INFO  ClientCnxn:975 - Opening socket connection to server
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 09:07:07 INFO  ClientCnxn:852 - Socket connection established to
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181, initiating session
> 09:07:17 INFO  ClientCnxn:975 - Opening socket connection to server
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181. Will not attempt to
> authenticate using SASL (unknown error)
> 09:07:17 INFO  ClientCnxn:852 - Socket connection established to
> ip-10-20-1-6.ec2.internal/10.20.1.6:2181, initiating session
> 09:07:17 INFO  ClientCnxn:1235 - Session establishment complete on
> server ip-10-20-1-6.ec2.internal/10.20.1.6:2181, sessionid =
> 0x20000178e5d0083, negotiated timeout = 30000
>
>
Are you using auth? e.g. Kerberos? Basically we attempt to connect using
SASL, if the attempt fails we retry without it. (and as a result no auth)


> With 10.20.1.6 being the Zookeeper box. There's definitely something
> fishy going on here but I can't untanggle what is it.
>
> I'd thankfuly appreciate if you guys can point me into the right
> direction to where can I learn more about these errors.
>

You might check the Kafka docs to see if they talk about the jute.maxbuffer
issue. I'm assuming they see this on larger deployments and must discuss it
somewhere.

Regards,

Patrick



> Thanks!
>

Re: Trying to figure out what these errors mean

Reply via email to