Hi,

On Fri, Apr 30, 2010 at 6:28 PM, Bob Wyman <[email protected]> wrote:
> Dave Cridland <[email protected]> wrote:
> (Yes, we can reduce the
> client's to simply be display devices for state maintained on intermediary
> servers, but this is not, I think, ideal.)

I disagree. Caching works, as proved by HTTP. The problems with
caching in the HTTP world are mostly caused by the lack of a good
cache-update mechanism to make sure your caches are up-to-date.

Taking the fat twitter notifications (with personal data, personal
prefs and all that in the same notification) as an example, if you
model each of those data units as a separate PEP node, and allow
"last-mile" clients to use the pubsub caching server on his own
domain, that cache server will be notified on each change to the
information and it will be able to update his own clients.


> A reliance on clients' maintaining state would also seem to assume that a
> reasonably high percentage of the traffic shares message-independent
> "static" information with messages received earlier and thus that cache-hit
> rates are reasonably high.

And thats true for both the fat twitter message, and the example Dave
presented. A lot of the Atom metadata is not about the notification
itself but about the source of the notification. Split that into a
separate node.

Client maintenance of state is most useful when
> all messages have the same originator. It is least useful when every message
> has a unique sender.

I subscribe about 300 Atom/RSS feeds. That translates to 600 to 700
"notifications" per day. Right now, I'm using a classical Pull system,
so the source metadata is shared with 10 to 30 notifications but most
of those are waste because I already have them.

If I switch to a Push system, I would reduce the wast because I would
only get each notification once. On the other side, each of those
notifications will send me all the source metadata that I don't need
every time.

So even with a short number of sources, extracting the source
meta-data would be useful.


> However, in the future, I'm fairly confident that we'll see an
> increase in the number of systems that support "content-based" publish and
> subscribe. Thus, we'll see messages being delivered because of their
> content, not simply because of their author. This sort of thing will be very
> much like the "Track" function that originally influenced, in part,
> Twitter's adoption of "Atom over XMPP". In the "Track" use case, (when you
> might subscribe to all messages containing the keyword "XMPP") you'll often
> get messages from senders that you've never seen before or will never see
> again. Thus, you'll often find that cache hit rates are lower than you'd
> like even though you may dedicate a great deal of resource to maintaining
> that cache.

Lets model the problem a bit: a large number of users (several
millions potentially) receive a small set of notifications (per user,
lets say 2000 per day), from a large set of sources (the same as
number of users in a balanced publisher/subscriber social network,
although I think the lurkers >> the publishers). The question here is:
do we send the source meta-data on each notification?

Looking at the HTTP world, I can see a very simillar pattern with
JavaScript frameworks like jQuery and Prototype. You have large set of
users, each one browsing a small set of pages. Those pages share their
use of JavaScript frameworks.

Using CDN's like Google AJAX API (I think thats the correct name) or
Yahoo! stuff, all those sites share the same URL to the JS framework,
to make sure that all caches can reuse the same object for all sites,
given better performance for everybody.

If you assume that you move the source meta-data to a separate node,
and don't include it on each notification, local cache systems can
provide better performance even on those situations of content-based
delivery because across you local server several clients will request
the same source metadata with the same key.


> So, we see that, at least, limitations in the XMPP protocol, resource
> limitations on the clients, and a move towards cache-inefficient
> content-based routing all tend to argue against an assumption that we can
> rely on clients to maintain state...

Client state is only limited by the local storage of the device, not
by the nature of the current or possible future nature of the
notifications (right now white-listed blogroll/followers list, future
content-based tracking like Collecta).

I argue that cache works on both situations if the information inside
the notification is properly arranged so that common units have the
same source address inside the multiple notifications.

Bye,
-- 
Pedro Melo
http://www.simplicidade.org/
xmpp:[email protected]
mailto:[email protected]

Reply via email to