Follow-up: Re: kafka-mesos still refusing to launch brokers on one cluster

2016-05-17 Thread Justin Ryan
Hiya again!

Following up..

Some weeks ago folks on the list helped me troubleshoot a couple of issues:

  (a) kafka-mesos completely failing to register a framework at all on one of 
my clusters
  (b) tasks disappearing from mesos view, even though the processes are still 
running

I have at least somewhat overcome (a), however it seems that although my 
kafka-mesos config has all three zookeeper / mesos-master hosts, it only works 
when running locally on the active leader.  There are no firewall rules which 
prevent communication between these hosts – it would be difficult for zk and 
mesos to have elections at all if so.

I also noticed this morning regarding (b) that my kafka brokers had disappeared 
from the mesos list, but once I forced the host running kafka-mesos scheduler 
as the leader and restarted the scheduler, it found all of the existing running 
processes and has them listed as RUNNING and started 4 weeks ago.

Has anyone run into issues like this, or have any ideas what I might be 
missing? I'm approaching confidence to run in production since *technically* 
nothing is horribly wrong wrt the kafka brokers actually working, and I run a 
*lot* of them, so I'm not terribly worried about the scheduler being able to 
immediately relaunch a failed broker.  Mostly it just feels like 
man-behind-the-curtain distributed systems.  "Yes, yes, we have HA, but it 
usually doesn't work unless I fiddle with it manually." ;d


As always, thanks in advance!

Justin





On 4/19/16, 2:13 PM, "Justin Ryan" <jur...@ziprealty.com> wrote:

>Hiya Vinod, thanks again for chiming in!
>
>
>Both frameworks are indeed binding to the same IP/interface, the kafka-mesos 
>scheduler has LIBPROCESS_IP set, per its’ docs.
>
>
>strace tells me that it tries to connect and then pretty much goes quiet. this 
>socket does show as ESTABLISHED in netstat.
>
>
>—
>0419 13:11:49.881398 30511 sched.cpp:222] Version: 0.27.1
>I0419 13:11:49.890899 30556 sched.cpp:326] New master detected at 
>master@10.100.1.158:5050
>[pid 30556] connect(43, {sa_family=AF_INET, sin_port=htons(5050), 
>sin_addr=inet_addr("10.100.1.158")}, 16) = -1 EINPROGRESS (Operation now in 
>progress)
>I0419 13:11:49.892683 30556 sched.cpp:336] No credentials provided. Attempting 
>to register without authentication
>[pid 30531] +++ exited with 0 +++
>Process 30578 attached
>[pid 30552] +++ exited with 0 +++
>
>--
>
>
>
>
>From: Vinod Kone <vinodk...@apache.org>
>Reply-To: "user@mesos.apache.org" <user@mesos.apache.org>
>Date: Tuesday, April 19, 2016 at 1:08 PM
>To: user <user@mesos.apache.org>
>Subject: Re: kafka-mesos still refusing to launch brokers on one cluster
>
>
>
>
>On Tue, Apr 19, 2016 at 11:24 AM, Justin Ryan
><jur...@ziprealty.com> wrote:
>
>Marathon has no trouble registering a framework and launching jobs on this 
>cluster, only kafka-mesos. :/
>
>
>
>
>
>
>
>
>
>
>Are both these frameworks binding to the same IP/interface? Does anyone of 
>them use any LIBPROCESS_IP or LIBPROCESS_PORT env variables?
>


P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-19 Thread Justin Ryan
Hiya Vinod, thanks again for chiming in!

Both frameworks are indeed binding to the same IP/interface, the kafka-mesos 
scheduler has LIBPROCESS_IP set, per its’ docs.

strace tells me that it tries to connect and then pretty much goes quiet. this 
socket does show as ESTABLISHED in netstat.

—
0419 13:11:49.881398 30511 sched.cpp:222] Version: 0.27.1
I0419 13:11:49.890899 30556 sched.cpp:326] New master detected at 
master@10.100.1.158:5050
[pid 30556] connect(43, {sa_family=AF_INET, sin_port=htons(5050), 
sin_addr=inet_addr("10.100.1.158")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
I0419 13:11:49.892683 30556 sched.cpp:336] No credentials provided. Attempting 
to register without authentication
[pid 30531] +++ exited with 0 +++
Process 30578 attached
[pid 30552] +++ exited with 0 +++
--

From: Vinod Kone <vinodk...@apache.org<mailto:vinodk...@apache.org>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, April 19, 2016 at 1:08 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: kafka-mesos still refusing to launch brokers on one cluster


On Tue, Apr 19, 2016 at 11:24 AM, Justin Ryan 
<jur...@ziprealty.com<mailto:jur...@ziprealty.com>> wrote:
Marathon has no trouble registering a framework and launching jobs on this 
cluster, only kafka-mesos. :/


Are both these frameworks binding to the same IP/interface? Does anyone of them 
use any LIBPROCESS_IP or LIBPROCESS_PORT env variables?


P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-19 Thread Vinod Kone
On Tue, Apr 19, 2016 at 11:24 AM, Justin Ryan  wrote:

> Marathon has no trouble registering a framework and launching jobs on this
> cluster, only kafka-mesos. :/
>
>
Are both these frameworks binding to the same IP/interface? Does anyone of
them use any LIBPROCESS_IP or LIBPROCESS_PORT env variables?


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-19 Thread Justin Ryan
So, I sniffed out the conversation between my working kafka-mesos scheduler and 
the mesos master, and something I picked up that Wireshark pointed out was that 
there were 0 server packets in this conversation.  So the scheduler is simply 
POST-ing, and while it’s talking to the socket, it doesn’t really receive any 
feedback, which explains why my non-working example appears fine, but is not 
doing anything.

It still seems fairly clear that if there were firewall rules interfering, the 
POST would fail, and I confirm that I can telnet to the port.

What other reasons might a framework registration fail? Marathon has no trouble 
registering a framework and launching jobs on this cluster, only kafka-mesos. :/

From: Justin Ryan <jur...@ziprealty.com<mailto:jur...@ziprealty.com>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Monday, April 18, 2016 at 1:50 PM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: kafka-mesos still refusing to launch brokers on one cluster

Nope, nothing about the scheduler trying to connect. :/

I agree the message isn’t making it there – this is why I fired up tcpdump, and 
attached the HTTP conversation to my original message, because it’s not a 
particularly sensible HTTP conversation from my POV, but the scheduler is 
talking to the socket for the mesos API.

I’m running the scheduler on the same host, and there are no relevant firewall 
rules, all tables (almost said chains ;d) are ACCEPT.

I suppose I could try tcpdump on a successful register on the working cluster 
and see what’s diff, I’ll try this and report back.

From: Vinod Kone <vinodk...@apache.org<mailto:vinodk...@apache.org>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Monday, April 18, 2016 at 1:46 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: kafka-mesos still refusing to launch brokers on one cluster

Is there nothing in the mesos master logs about this scheduler trying to 
connect/register? If not, the registration message from the scheduler is not 
making it to the master. Are there any firewall rules between the scheduler 
host and master host?


P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-18 Thread Justin Ryan
Nope, nothing about the scheduler trying to connect. :/

I agree the message isn’t making it there – this is why I fired up tcpdump, and 
attached the HTTP conversation to my original message, because it’s not a 
particularly sensible HTTP conversation from my POV, but the scheduler is 
talking to the socket for the mesos API.

I’m running the scheduler on the same host, and there are no relevant firewall 
rules, all tables (almost said chains ;d) are ACCEPT.

I suppose I could try tcpdump on a successful register on the working cluster 
and see what’s diff, I’ll try this and report back.

From: Vinod Kone <vinodk...@apache.org<mailto:vinodk...@apache.org>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Monday, April 18, 2016 at 1:46 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: kafka-mesos still refusing to launch brokers on one cluster

Is there nothing in the mesos master logs about this scheduler trying to 
connect/register? If not, the registration message from the scheduler is not 
making it to the master. Are there any firewall rules between the scheduler 
host and master host?


P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-18 Thread Vinod Kone
Is there nothing in the mesos master logs about this scheduler trying to
connect/register? If not, the registration message from the scheduler is
not making it to the master. Are there any firewall rules between the
scheduler host and master host?


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-18 Thread Justin Ryan
FWIW, Here’s the output of kafka-mesos.sh scheduler:

—
[root@zk01 kafka-mesos]# ./kafka-mesos.sh scheduler
Loading config defaults from kafka-mesos.properties
2016-04-13 15:34:29,168 [main] INFO  ly.stealth.mesos.kafka.Scheduler$  - 
Starting Scheduler$:
debug: true, storage: zk:/mesos-kafka
mesos: 
master=zk01.something.com:5050,zk02.something.com:5050,zk03.something.com:5050, 
user=marathon, principal=, secret=
framework: name=kafka, role=*, timeout=30d
api: http://zk01.something.com:7000, bind-address: , zk: 
zk01.something.com:2181,zk02.something.com:2181,zk03.something.com:2181/kafka, 
jre:
2016-04-13 15:34:29,303 [main] INFO  org.eclipse.jetty.server.Server  - 
jetty-9.0.z-SNAPSHOTWrappedArray()
2016-04-13 15:34:29,336 [main] INFO  
org.eclipse.jetty.server.handler.ContextHandler  - Started 
WrappedArray(o.e.j.s.ServletContextHandler@3aefe5e5{/,null,AVAILA
2016-04-13 15:34:29,351 [main] INFO  org.eclipse.jetty.server.ServerConnector  
- Started WrappedArray(ServerConnector@71b1176b{HTTP/1.1}{0.0.0.0:7000})
2016-04-13 15:34:29,352 [main] INFO  ly.stealth.mesos.kafka.HttpServer$  - 
started on port 7000
I0413 15:34:29.420655 12601 sched.cpp:222] Version: 0.27.1
I0413 15:34:29.423641 12647 sched.cpp:326] New master detected at 
master@10.100.1.158:5050
I0413 15:34:29.424010 12647 sched.cpp:336] No credentials provided. Attempting 
to register without authentication
—

In a working setup, the next line logged is the UUID of the framework 
registered, so this is why I tcpdumped the traffic to the mesos master in my 
prev message.

From: Justin Ryan <jur...@ziprealty.com<mailto:jur...@ziprealty.com>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Monday, April 18, 2016 at 1:35 PM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: kafka-mesos still refusing to launch brokers on one cluster

They don’t say anything.  Mesos never attempts to start the jobs, as far as I 
can tell.  The framework fails to register, for reasons I can’t determine, 
nothing is ever logged that matches a search for ‘kafka’, and the brokers in 
kafka-mesos broker status show as ‘launching’ or somesuch, forever.

It is quite perplexing.

From: Vinod Kone <vinodk...@apache.org<mailto:vinodk...@apache.org>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Monday, April 18, 2016 at 12:43 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: kafka-mesos still refusing to launch brokers on one cluster


On Mon, Apr 18, 2016 at 12:32 PM, Justin Ryan 
<jur...@ziprealty.com<mailto:jur...@ziprealty.com>> wrote:
So, test is working again, but prod is still and has consistently been dead

What do the master/agent/scheduler logs say regarding kafka tasks?


P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.


Re: kafka-mesos still refusing to launch brokers on one cluster

2016-04-18 Thread Vinod Kone
On Mon, Apr 18, 2016 at 12:32 PM, Justin Ryan  wrote:

> So, test is working again, but prod is still and has consistently been dead


What do the master/agent/scheduler logs say regarding kafka tasks?


kafka-mesos still refusing to launch brokers on one cluster

2016-04-18 Thread Justin Ryan
Hiya all!

In recent weeks, folks on the list have given me some pointers on a couple 
issues I had, one where tasks were disappearing (hasn’t happened since I fired 
up a script to log status, of course ;d) and kafka-mesos refusing to launch 
brokers.

Quick overview, I have two clusters: a test cluster and a fresh, prod cluster.  
Both are built using the same chef code, with the only differences in 
configuration being the zookeeper / mesos master hostnames.  An unfortunate 
difference, which would be difficult to reverse at this point, is that the test 
cluster runs Ubuntu, and prod runs CentOS.  Our IT dept handed me VMWare keys 
to start the project and I did not know they’d be mandating CentOS for prod, 
but it also seems unlikely this is the cause of my troubles.

After weeks of everything working fine in test, everything started failing in 
test recently.  I’m fairly confident now this is due to an issue where my 
password change hadn’t gone through on one host, the fabric tasks I use to 
clear zookeeper when everything gets funky were failing on one host, and thusly 
zookeeper never actually got cleared.  So, test is working again, but prod is 
still and has consistently been dead.

I have sniffed the traffic between kafka-mesos scheduler and the mesos master – 
currently running on the same host – and am not really sure what to make of it

—
POST /master/mesos.scheduler.Call HTTP/1.1
User-Agent: 
libprocess/scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Libprocess-From: 
scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Connection: Keep-Alive
Host:
Transfer-Encoding: chunked

3b
...7
5
.juryan..kafka!..CA(.2.*:.zk01.something.com
0

POST /master/mesos.scheduler.Call HTTP/1.1
User-Agent: 
libprocess/scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Libprocess-From: 
scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Connection: Keep-Alive
Host:
Transfer-Encoding: chunked

3b
...7
5
.juryan..kafka!..CA(.2.*:.zk01.aur.ziprealty.com
0

POST /master/mesos.scheduler.Call HTTP/1.1
User-Agent: 
libprocess/scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Libprocess-From: 
scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Connection: Keep-Alive
Host:
Transfer-Encoding: chunked

3b
...7
5
.juryan..kafka!..CA(.2.*:.zk01.something.com
0

POST /master/mesos.scheduler.Call HTTP/1.1
User-Agent: 
libprocess/scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Libprocess-From: 
scheduler-d02d88a5-b2cc-4712-b0c9-1c543a14d26d@10.100.1.158:45472
Connection: Keep-Alive
Host:
Transfer-Encoding: chunked

3b
...7
5
.juryan..kafka!..CA(.2.*:.zk01.something.com
0
—

I’m still a bit unclear on where to go from here.  My work_dir is no longer in 
/tmp, as suggested, but the brokers never start and there is no /kafka in zk.

Part of me is inclined to rebuild the test cluster with CentOS for consistency, 
but the ubuntu cluster is the only one I have working, and I know it’s possible 
that something inadvertently different in the defaults between the two is at 
play, which might be worth understanding, but I also see how I’ve gotten myself 
into an odd pickle. ;)

Anyway, as always, any input would be appreciated.  A successful test phase 
isn’t much good if we can’t actually launch!

Cheers,

Justin Alan Ryan
Sr. Systems Engineer
ZipRealty / Realogy


P Please consider the environment before printing this e-mail

The information in this electronic mail message is the sender's confidential 
business and may be legally privileged. It is intended solely for the 
addressee(s). Access to this internet electronic mail message by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it is 
prohibited and may be unlawful. The sender believes that this E-mail and any 
attachments were free of any virus, worm, Trojan horse, and/or malicious code 
when sent. This message and its attachments could have been infected during 
transmission. By reading the message and opening any attachments, the recipient 
accepts full responsibility for taking protective and remedial action about 
viruses and other defects. The sender's employer is not liable for any loss or 
damage arising in any way.