[Tech] LIFO simulations

Jano Wed, 13 Dec 2006 01:53:38 +0100

Michael Rogers wrote:

>> * I'm dropping messages at the tail when queues reach 50.000 messages
>> queued (for search and transfer queues). I implemented this in the hope
>> of getting rid of OOMs. I'm getting them anyway, so I've screwed
>> something in the process.
> 
> Not necessarily - 500 peers * 2 queues * 50,000 messages could easily
> eat a gig or two of memory.


My Sim.NODES is 100? In any case the queues getting larger are the
transfers, and with your below 17KB mean reply size that's clearly a lot
more than 2G. By large.

>> I could only simulate up to 30 with lifo queues and this change;
>> see the graph. I don't think it's correct. Have we some idea on what is
>> the theoretical maximum throughput for the simulated network, as
>> currently defined?
> 
> It depends on how far the data's travelling. Ignoring inserts and slow
> nodes for the moment, half the requests are for CHKs and half are for
> SSKs, so the average size of a reply is about 17 KB. The total capacity
> of the network is 1500 KB per second, and the maximum sustainable
> throughput (for FIFO at least) seems to be about 8000 in replies in 2
> hours = 11 replies per second. That would imply that replies are
> travelling an average of 1500/(11*17) = 8 hops, but that's a *very*
> rough estimate.

I'm lost here, I don't see where the 8000 comes from (maybe 40k*2h? I think
so and then 80k/2/3600=11,11). Your estimate seems plausible and with your
suggested changes (that I snipped) we'll know for sure.

In any case, I'm more interested in getting at some maximum number of
successes that a perfect routing couldn't surpass (because the linear
progression of my last lifo run seems very suspicious).

Please correct me:

We have 1500k/s total data being transmitted. Assuming all this were going
to replies at the last hop of a successful retrieval (and that's being
optimistic! :), we'd have 1500k/17k=88,24 successes/second. That is, a
maximum of 88*2*3600<634k successes per simulation run. Ok, my lifo runs
don't reach that maximum. But, assuming the 8hop average, we have only ~79k
maximum throughput, and then my last simulation is clearly overboard. Even
doubling that value (since requests are very small), I'm over that value.
And I'm simulating with slow nodes.

>> * I'm counting just remote successes. If we are measuring the load
>> balancing performance, I don't think the local hits are of any interest
>> and could mask the remote ones.
> 
> Good point - maybe that's why our figures for backoff at high loads are
> different. (I'm also at revision 11135 by the way.) It makes sense that
> you'd see a low success rate but high throughput if requests were either
> succeeding locally or not at all.

Yep. I missed this difference when writing before, so it's definitely a
factor that could change my results. I've introduced this change just in
this last run, BTW.

> Unfortunately this suggests that simply counting the number of successes
> (or even remote successes) isn't an adequate measure of throughput -
> being able to retrieve the nearest tenth of the keyspace in one minute
> isn't equivalent to being able to retrieve the entire keyspace in ten
> minutes...

Good point you have here...

> Any suggestions for a better metric?

At this preliminary point, I'd say that remote successes are a good start.
We could later study the hop count of successes in a non-saturated network
and compare to the saturated cases.

>> * I'm not computing failures anymore since messages dropped by far exceed
>> successes. Half an hour of simulation would produce near 1GB of logs,
>> given the amount of msgs dropped!
> 
> We should probably replace the logging statements with a static counter.

I saw your mail about this, thanks. I won't have time to try it until
friday, I reckon. I'll take advantage of this and redo my lifo changes more
carefully. If I could get svn write right I could put them in a new "phase"
in the sim repository...

> Bear in mind that a dropped message doesn't necessarily lead to a
> failure - under some circumstances the upstream node can move on if it
> gets a timeout, so a search can suffer several dropped messages and
> still succeed.

This was an oversight in my part, right. Do this mean that the failure will
always be reported by the requesting/inserting node (when timeouting), so
the drops shouldn't be counted in any case?

Kind regards,

Alex.

[Tech] LIFO simulations

Reply via email to