Re: [Standards] Presence distribution

Dave Cridland Mon, 06 Apr 2009 11:10:33 -0700

On Mon Apr  6 17:54:16 2009, Philipp Hancke wrote:

Actually, the current model enables a single client to bring down
a server by making the server broadcast large presence stanzas at
a high rate. Simplistic bandwidth control mechanisms such as "karma"
account traffic per socket, and do not throttle the client when it
generates excessive amounts of s2s traffic.

How does the client force the subscription?

Yes. But not at the cost of making the protocol worse. By the way,the
It is not "worse", just different.

No, it is worse, the failure case of the missing unsubscribed is aprivacy leak.

Let's review what happens currently, assuming an initiatingcontact with n availableremote contacts across d domains, each ofwhich has y resources connected - y being different in eachinstance obviously:
"available remote contacts" is those bis optimizations?
Actually, were those discussed before they were introduced?
What is their impact on presence reliability?

Yes, and they're not mandated.

1) Client sends home server a broadcast presence. Cost E1, O(1).


Why not shift the responsibility for distribution to the client?
This would make E1=O(n) and E2=O(1) (simple message routing).
Afaics, your argument applies for that as well.

Ah, but the server has responsibility anyway, so it's a reasonableoptimization there.

2) Home server, which (almost certainly) has client's rosterin-memory, iterates through and emits one presence stanza perremote subscribed contact known to be available. Cost E2, O(n).(Iterating through a list of known available contacts).
Actually, why don't you send the presence stanza to each connected
resource of each available contact here - iterating through a listof
known avaible contact resources)?

It's not that server's responsibility to maintain that definitivelist, though.

3) Home server encodes and transmits N stanzas, remote serverdecodes and transmits N stanzas. Cost E3, O(n). (One stanza percontact - arguably this is O(d) to open the stream, and O(y) tosend the stanzas - you pick).
O(y) should be O(n) here?

No, it's d*O(y), but that's equivalent to O(y). I don't think itmakes a huge difference whether it's O(y) + O(d) or O(n), though.

4) Remote server receives one or more fully-addressed stanzas, andbroadcasts them to all resources. Cost E4, O(y). (Iteratingthrough a list of resources of the recipient).
This is done n times. And you are neglecting costs for privacy list
checks (I will call that O(p), different for each instance and
quite costly imo).
Hence E4 = O(n) + n*(O(p) + O(y))

Well, the moment you start considering privacy lists, the entirething breaks, because evaluation of those needs handling at bothends, so a reverse roster lookup fails. Instead, what's needed islist building, which means evaluating the privacy list locally ntimes, in order to collate a per-domain list of contacts. I've notconsidered this, because of the risks of such lists being lost, orout of sync themselves, and the problems that having arbitraryservers able to command resources on your server.

This gives E, as O(n) + O(y), a linear complexity algorithm.


O(n) + n*(O(p) + O(y)).

That's O(n + np + ny), I think.

You want to replace 2-4 with:
2') Home server collates roster by remote domain and emits onepresence stanza per remote domain which has a subscribed contactsknown to be available. This is, I'll accept, likely to be close tothe energy cost of 2, although due to the fluctuating nature ofthe final clause that collation has to be done each time, so it'llbe a little above. Cost E2' is, therefore > E2. O(n) + O(d) - Ihave a feeling this can be reduced to O(n).
3') encode/transmit/decode 1 stanza per remote domain. This iscertainly an energy/cpu/cost saving compared to the 3 above, noargument from me here. Cost E3' is < E3. O(d) (One stanza perdomain. Still linear, of course).
Linear with d. Assuming that people 'cluster', d << n is reasonable.

Sure, but I'm astonishingly unconvinced that the O(y) term in E3 issignificant - I suspect the maintainence of the domain XMLstreams isby far the bigger cost, here - in fact, I personally think it'sentirely reasonable to suggest that for all practical purposes, E3 ==E3' == O(d), although in practise E3' will be slightly smaller, andin the case where y is particularly big, that difference will ofcourse become significant.

But for the vast majority of cases, y == 1 - in my roster, it's amaximum of 12 (for jabber.org), the next highest is 6 or so (forgmail.com), and the remaining levels or 2 or 1, across quite a fewdomains. I don't think under these cases that - unless the constantinvolved is massive - O(y) is going to be a substantial factor.

4') Lookup sender against all rosters in the system, and detirminewhich of the resulting potential recipients is online andavailable to the sender. It seems reasonable that it would bepossible to limit the lookup to only contacts who are online andavailable - assuming we ignore privacy lists - but you're stilladding a lookup and the
This depends on how you do the lookup imo.

Yes, Waqas argues that it might be replacement rather than addition.

associating lookup mechanics (like a hash table or something)which must be maintained in-memory. I strongly suspect that thismuch, much more costly to use, build, and maintain than 4 above,hence E4' >> E4. I
If you are neglecting privacy lists above certainly.

I think you can reduce this to E4' = n*O(y).

n cannot possibly occur in this - n is a figure local to theinitiating server. It might happen d times, though, on differentservers.

believe this to be possible to implement as O(log(y)), but I alsosuspect it's overwhemlingly more likely that it's O(y) (as derivedfrom a reverse roster lookup O(log(y)) combined with aresource-broadcast lookup O(y)).
This give E' as O(n) + O(d) + O(y), a linear compexity algorithm.Poof goes your argument above about linear complexity versusconstant amortized time.
I can't see (E3' - E3) < ((E4' - E4) + (E2' - E2)) as being at alllikely, so I continue to maintain that this is a falseoptimization.
I disagree. But at least your argument is technically founded andbetter
than what I have heard from the council in 2006 (and I had to gather
that from the chatroom).

Waqas points out some potential flaws - I'm still not convinced thatit represents a cost-saving for either end, but the difference mightwell be smaller than I thought. In particular, he suggests that E4might be implemented as a single multivalued hashtable, thus reducingthat lookup to O(1).

We had a very interesting conversation about the relative costs ofstring internalization and all sorts.

I care about effective flood control mechanisms. This is aboutmoving
cost where they are affordable.
(I also continue to maintain that this is an arse-clenchingly morecomplex solution to the problem of getting presence from onecontact to another, but I hope nobody's arguing against me, there).
It is a different solution. One that has been used for years on IRC.

IRC has a different structure, and uniform view throughout thenetwork - indeed, one server is specifically designed to beindistinguishable from any other. That's not the case in XMPP, andmeans different approaches and considerations. IRC behaves like asingle XMPP cluster, in that respect, where all these optimizationsare entirely possible.

p.s.: I still want a review for that 220-rewrite from you.


Pester me next week?

Dave.
--
Dave Cridland - mailto:[email protected] - xmpp:[email protected]
 - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
 - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade

Re: [Standards] Presence distribution

Reply via email to