Summary: Intermittent severe lag in mp-server games
                 Project: Battle for Wesnoth
            Submitted by: involution
            Submitted on: Fri 14 Oct 2016 10:18:36 PM UTC
                Category: Bug
                Severity: 3 - Normal
                Priority: 5 - Normal
              Item Group: Multiplayer
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 1.12
        Operating System: ubuntu



Sometimes, when playing wesnoth games on the mp-server, the game will lag for
an extended period of time during your turn when you make a move.

This means, for instance, when you order an attack or a recruit, the move part
of the action will be executed, but then your client will basically lock up (I
think `command_locker` object is active in terms of the source code) so that
you cannot save or issue new orders.

After about 30sec to a minute usually, the action will complete and the game
can continue.

Initially I thought that it must be that the server is blocking when trying to
generate randomness, since only the actions that seem to require randomness
appear to cause the lag. In wesnoth 1.12 randomness is always generated by the
server for the clients to prevent cheating.

However, I have no good reason to believe that's the issue, the server may
just be blocking on handling all messages as far as I know. Usually when this
happens, players chat "lag" to one another, but it's not clear how long it
takes those messages to be relayed.

Usually the games proceed normally after the lag passes.

I mentioned this on wesnoth-dev IRC on Oct 11 2016:

20161011 17:48:12< iceiceice> shadowm, i wanted to ask you a question about
the mp server
20161011 17:48:45< iceiceice> so, one of the annoying things about 1.12 server
recently (last few months) is that sometimes you will order an attack on a
unit, and the game will lag
20161011 17:48:56< iceiceice> like, for 30 seconds or a minute
20161011 17:49:28< iceiceice> then it will proceed normally
20161011 17:49:45< iceiceice> i dont know why this would happen, currently the
only theory i have is that the mp server runs out of entropy
20161011 17:50:08< iceiceice> and needs time to acquire in /dev/random
20161011 17:50:09< celticminstrel> Why is the server's entropy relevant?
20161011 17:50:18< iceiceice> because the server generates randomness used
20161011 17:50:21< iceiceice> for any attack or recruit
20161011 17:50:24< iceiceice> since 1.12
20161011 17:50:28< celticminstrel> Uhh. Really?
20161011 17:51:03< tad_carlucci> And access to the entropy source takes a few
micro seconds.
20161011 17:51:08< celticminstrel> But even if that's true, entropy shouldn't
be an issue.
20161011 17:51:12< celticminstrel> It doesn't need any entropy.
20161011 17:51:18< celticminstrel> It just needs a Mersenne twister.
20161011 17:51:31-!- Greg-Boggs [~greg_bogg@] has quit [Remote
host closed the connection]
20161011 17:51:34< iceiceice> celticminstrel i don't recall exactly how we set
it up
20161011 17:51:35< celticminstrel> There's not need to have true randomness
for a game.
20161011 17:51:38< celticminstrel> ^no
20161011 17:51:42< gfgtdf> iceiceice: i think the 1.12 server just uses
20161011 17:51:45< iceiceice> i agree but i dont recall if we coded that
20161011 17:51:59< iceiceice> idk if there are useful logs that can help debug
something like this?
20161011 17:52:04< iceiceice> its quite difficult to reproduce locally

20161011 17:54:13< gfgtdf> iceiceice: do you know whther other people goit
this issue too? And does it only happen when the server is full?
20161011 17:54:23< iceiceice> yes, you can ask any regular mp player
20161011 17:54:47< iceiceice> it *might* be related to when the server shuts
down, i'm not sure
20161011 17:54:57< gfgtdf> full like in 'a lot of player/games running'.
20161011 17:55:14< iceiceice> that's kind of hard to judge, there is not a
counter of how many games are running usually
20161011 17:55:40< iceiceice> i dont think it particularly has to do with the
server being busy
20161011 17:55:52< gfgtdf> iceiceice: can you do chat or receive chat while
waiting the 30 mins ?
20161011 17:56:10< iceiceice> its not 30 min, its like 30 sec to a min
20161011 17:56:19< iceiceice> i'm not totally sure if you can send / recieve

shadowm writes on Oct 11:

20161011 19:45:08< shadowm> iceiceice, gfgtdf: I gave loonycyborg access and
instructions (the other two would have to figure things out) for the purpose
of debugging wesnothd after the asio port.
20161011 19:45:24< shadowm> iceiceice, gfgtdf: And no-one has password access,
not even me.
20161011 19:46:24-!- ChipmunkV [~v...@static-89-94-113-91.axione.abo.bbox.fr]
has quit [Quit: ChipmunkV]
20161011 19:46:35< shadowm> iceiceice, gfgtdf: The log isn't public because it
contains IP addresses and ban logs.

20161011 21:18:36< gfgtdf> shadowm: do you have an idea what coudl habve
caused these lags in 1.12 ?
20161011 21:19:17< gfgtdf> shadowm: in teh last few months
20161011 21:25:13-!- iwaim [~iw...@rasteenie.alib.jp] has joined #wesnoth-dev
20161011 21:26:33< shadowm> Depends on the time of day, really.
20161011 21:27:27< shadowm> And day of the week.
20161011 21:29:14< gfgtdf> iceiceice: ^
20161011 21:29:32< shadowm> It's pretty much impossible for me to say anything
more useful without more information. I neither play MP or keep an eye on the
MP server since as you know I've been pretty much absent here since March.
20161011 21:31:05-!- Duthlet
[~duth...@dslb-188-104-253-155.188.104.pools.vodafone-ip.de] has quit [Quit:
20161011 21:32:53< shadowm> Plus, to give you an example, the log isn't as
verbose as you might expect: http://pastebin.com/3wd8uv6u
20161011 21:34:54< gfgtdf> shadowm: that looks like it doesn't contain more
information that the publig log at
20161011 21:35:15< shadowm> It contains more information, just not what you'd
be looking for.
20161011 21:35:31< shadowm> Otherwise I'd not have had to redact several bits
in there.
20161011 21:36:25< shadowm> Also, if it helps: 
20161011 21:36:27< shadowm>     03:27:49 shadowm@wesnothd:~$ uname -a
20161011 21:36:32< shadowm>     Linux wesnothd 2.6.32-19-pve #1 SMP Wed May 15
07:32:52 CEST 2013 x86_64 GNU/Linux
20161011 21:36:44< shadowm> *cue shock and horror* (Nope, can't do anything
about it.)
20161011 21:40:46< gfgtdf> shadowm, iceiceice: i wonder whther we could
somehow measure the time each pagage needs to be processed and log it if it
took to long or whether doing this woudl slow the server down too much.
20161011 21:42:05< shadowm> Not much I can do there other than rebuild/patch
and restart the server, etc. I've never been able to look at wesnothd's code
for too long.

Soliton also writes on Oct 12:

20161012 16:55:10< Soliton> iceiceice, gfgtdf: we're already logging some
performance stats. you can try '/query metrics' and '/query requests' to see
20161012 16:55:41< gfgtdf> Soliton: can i do that or do i ned more rights?
20161012 16:55:52< Soliton> for example currently one can see that one
leave_game request took 10 seconds to be processed.
20161012 16:56:07< Soliton> gfgtdf: anyone can see those stats.
20161012 16:57:51< Soliton> metrics currently reports: '12663393 requests
serviced. 10699877 (84%) requests were serviced immediately.'  in the logs i
can see that for a previous instance that percentage was down to 20%.
20161012 16:59:29< Soliton> of course figuring out what this means and if
there is an issue somewhere we can fix is another question...
20161012 16:59:47< gfgtdf> Soliton: thats still good to know thx
20161012 17:00:07< gfgtdf> Soliton: you know ehther the 1.13 boost::asio
server supports this aswell ?
20161012 17:00:19< Soliton> i guess first it'd be useful to know when these
lags happen. like time of day and in what situations.
20161012 17:01:10< Soliton> gfgtdf: looks like most of it is not implemented
there yet.

Rhonda writes on Oct 12, 2016:

20161012 07:11:58< Rhonda> Is the thing about stderr of wesnothd solved, or
can/should I help with that still?
20161012 07:12:36< shadowm> More information is needed first.

Opening this bug on the bug tracker so that we can properly track it.

I don't know how we can begin to solve it other than reading all the wesnothd
code to figure out why it might lag, or adding lots of logging to the
presently running wesnothd instance. There might be some secondary tool we can
use to monitor wesnothd for lag, which would be preferable.

Any ideas or more information welcome.


Reply to this item at:


  Message sent via/by Gna!

Wesnoth-bugs mailing list

Reply via email to