[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #6 from C. Scott Ananian canan...@wikimedia.org ---
Well, it was 3-4 of the parsoid hosts which had this problem.  So it's possible
that these were hung processes which needed to wait for the timeout.

But in this case I would expect 'service parsoid stop' not to actually complete
until the timeout had finished and the process had actually stopped.  The
problem was that the service was restarted while the old parsoid was still
running (or so it would seem).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #7 from ssas...@wikimedia.org ---
Ah, if 3-4 then maybe that is what it is then. Ping gwicke about the service
restart part then. He might have a clue what is going on.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #1 from C. Scott Ananian canan...@wikimedia.org ---
Filed bug 73396 for the logs don't make it to logstash issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #2 from C. Scott Ananian canan...@wikimedia.org ---
Node's docs claim that they already use SO_REUSEADDR:

http://nodejs.org/api/net.html#net_server_listen_port_host_backlog_callback

This makes our EADDRINUSE errors very mysterious.

However, the 'cluster' package does mysterious things:
http://nodejs.org/api/cluster.html#cluster_how_it_works

It's possible that's what's going wrong here, somehow?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #3 from ssas...@wikimedia.org ---
Given that we have had this code for a long time, and also given that we have
also had a couple deploys post node 0.10 upgrade, and that this is the first
deploy after we enabled timeouts, I would say this is related to timeouts (5
mins).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #4 from C. Scott Ananian canan...@wikimedia.org ---
Possibly we're not shutting down cleanly when the service is stopped, and the
zombie process continues holding on to the port until the timeout expires.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 73395] Parsoid should use SO_REUSEADDR when it binds to its port

2014-11-13 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=73395

--- Comment #5 from ssas...@wikimedia.org ---
So, this is the code that we use to cleanly terminate the workers (see
api/server.js)

cluster.disconnect(function() {
logger.log( info, exiting );
process.exit(0);
});

This waits for worker to shut down. However, in the common case, the title
should get processed fairly quicly and clear the timeout = all the workers
should shut down in a timely fashion. I don't think (or at least, I would be
surprised if it were) all those cases of hung processes that do indeed take 5
mins and eventually get killed by the timeout.

So, one question I now have is if we are holding onto timeouts after request
processing and not clearing those out cleanly in all cases.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l