Michael Rogers wrote: >> * I'm dropping messages at the tail when queues reach 50.000 messages >> queued (for search and transfer queues). I implemented this in the hope >> of getting rid of OOMs. I'm getting them anyway, so I've screwed >> something in the process. > > Not necessarily - 500 peers * 2 queues * 50,000 messages could easily > eat a gig or two of memory.
My Sim.NODES is 100? In any case the queues getting larger are the transfers, and with your below 17KB mean reply size that's clearly a lot more than 2G. By large. >> I could only simulate up to 30 with lifo queues and this change; >> see the graph. I don't think it's correct. Have we some idea on what is >> the theoretical maximum throughput for the simulated network, as >> currently defined? > > It depends on how far the data's travelling. Ignoring inserts and slow > nodes for the moment, half the requests are for CHKs and half are for > SSKs, so the average size of a reply is about 17 KB. The total capacity > of the network is 1500 KB per second, and the maximum sustainable > throughput (for FIFO at least) seems to be about 8000 in replies in 2 > hours = 11 replies per second. That would imply that replies are > travelling an average of 1500/(11*17) = 8 hops, but that's a *very* > rough estimate. I'm lost here, I don't see where the 8000 comes from (maybe 40k*2h? I think so and then 80k/2/3600=11,11). Your estimate seems plausible and with your suggested changes (that I snipped) we'll know for sure. In any case, I'm more interested in getting at some maximum number of successes that a perfect routing couldn't surpass (because the linear progression of my last lifo run seems very suspicious). Please correct me: We have 1500k/s total data being transmitted. Assuming all this were going to replies at the last hop of a successful retrieval (and that's being optimistic! :), we'd have 1500k/17k=88,24 successes/second. That is, a maximum of 88*2*3600<634k successes per simulation run. Ok, my lifo runs don't reach that maximum. But, assuming the 8hop average, we have only ~79k maximum throughput, and then my last simulation is clearly overboard. Even doubling that value (since requests are very small), I'm over that value. And I'm simulating with slow nodes. >> * I'm counting just remote successes. If we are measuring the load >> balancing performance, I don't think the local hits are of any interest >> and could mask the remote ones. > > Good point - maybe that's why our figures for backoff at high loads are > different. (I'm also at revision 11135 by the way.) It makes sense that > you'd see a low success rate but high throughput if requests were either > succeeding locally or not at all. Yep. I missed this difference when writing before, so it's definitely a factor that could change my results. I've introduced this change just in this last run, BTW. > Unfortunately this suggests that simply counting the number of successes > (or even remote successes) isn't an adequate measure of throughput - > being able to retrieve the nearest tenth of the keyspace in one minute > isn't equivalent to being able to retrieve the entire keyspace in ten > minutes... Good point you have here... > Any suggestions for a better metric? At this preliminary point, I'd say that remote successes are a good start. We could later study the hop count of successes in a non-saturated network and compare to the saturated cases. >> * I'm not computing failures anymore since messages dropped by far exceed >> successes. Half an hour of simulation would produce near 1GB of logs, >> given the amount of msgs dropped! > > We should probably replace the logging statements with a static counter. I saw your mail about this, thanks. I won't have time to try it until friday, I reckon. I'll take advantage of this and redo my lifo changes more carefully. If I could get svn write right I could put them in a new "phase" in the sim repository... > Bear in mind that a dropped message doesn't necessarily lead to a > failure - under some circumstances the upstream node can move on if it > gets a timeout, so a search can suffer several dropped messages and > still succeed. This was an oversight in my part, right. Do this mean that the failure will always be reported by the requesting/inserting node (when timeouting), so the drops shouldn't be counted in any case? Kind regards, Alex.