----- mrogers at UU62+3E1vKT1k+7fR0Gx7ZN2IB0 ----- 2007.05.03 - 14:26:53GMT -----
I haven't looked at it in detail, but it looks like NodeStats.java is still keeping a separate liability estimator for each search type, correct? Sorry but I really don't agree with the rationale behind that. It might seem like it makes sense to accept (for example) an SSK request but not a CHK insert if we're really tight on bandwidth, but the knock-on effects of that (especially with a separate throttle for each type of search) are undesirable. It would make more sense to use a single bandwidth liability estimator for all searches (CHK or SSK, insert or request). Yes, it will overestimate the bandwidth liability for SSK requests and underestimate it for CHK inserts, but *on average* it will still be correct, and it won't have any weird knock-on effects. ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.05.04 - 12:48:20GMT ----- On average a lot of requests will time out if we implement it this way. ----- mrogers at UU62+3E1vKT1k+7fR0Gx7ZN2IB0 ----- 2007.05.06 - 15:19:43GMT ----- Sorry, can you explain why? On average we won't go over quota, so why will it cause timeouts? The bandwidth limiter can handle slightly bursty traffic. ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.05.10 - 12:06:00GMT ----- No, we are not supposed to go much over the limit for more than a few seconds. Because it interferes with other usage of the connection. ----- mrogers at UU62+3E1vKT1k+7fR0Gx7ZN2IB0 ----- 2007.05.10 - 13:36:55GMT ----- That's easily dealt with. For example, track the variance as well as the average, and don't accept a search unless there's a 99% chance you have enough bandwidth to handle it. Or calculate the maximum bandwidth a search could ever use (32 KB * (numPeers - 1)) and don't accept any searches unless you have that much bandwidth. It's really not important how you do it - the important thing is that we shouldn't accept an SSK when we would have rejected a CHK, or accept a request when we would have rejected an insert. Otherwise we'll get dismal performance for CHKs and dismal performance for inserts, respectively. ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.05.10 - 15:27:08GMT ----- Like I said, please have a look at the code. ----- mrogers at UU62+3E1vKT1k+7fR0Gx7ZN2IB0 ----- 2007.05.11 - 14:11:42GMT ----- I've looked, and the situation is still the same: we reject CHKs while accepting SSKs, and reject inserts while accepting requests, because there are four different running averages for bandwidth liability. Here's my understanding of the current situation, please correct me if I'm wrong: * We keep four separate bandwidth liability estimators for incoming SSK requests, SSK inserts, CHK requests and CHK inserts, respectively * When a request or insert comes in, we use the ping time and bandwidth delay to decide whether to reject it * If it passes those tests, we use one of the four bandwidth liability estimators to decide whether to reject it * We occasionally accept a CHK request or insert even if there isn't enough bandwidth, to keep the bandwidth delay estimator up to date While I was looking at the code I came across some more things I don't understand in NodeDispatcher.java - would you mind giving me some more information? // can accept 1 CHK request every so often, but not with SSKs because they aren't throttled so won't sort out bwlimitDelayTime, which was the whole reason for accepting them when overloaded... // SSKs don't fix bwlimitDelayTime so shouldn't be accepted when overloaded. The first seems to imply that incoming requests as well as local requests pass through the throttle - is that correct? But SSKs bypass the throttle - does that apply to local requests as well as incoming requests? The second suggests that SSKs aren't considered when calculating bwlimitDelayTime - which messages are considered? Thanks, Michael On Wednesday 08 August 2007 15:59, Matthew Toseland wrote: > Unfortunately this thread is rather rambling, it includes lots of > discussion on token passing as well as the original premise. > > ----- Anonymous at o9_0DTuZniSf_+oDmRsonByWxsI ----- 2007.04.25 - > 20:18:36GMT ----- > > I made some measurements on how freenet node behaves if bandwidth limit is > set low: 10KBps and downto 6KBps (specificially, input bandwidth limit; > output bandwidth limit was set to at least 15KBps but as expected factually > used output bandwidth is comparable (just slightly above) with factually > used input bandwidth). The node itself was running frost but no > uploads/downloads, so absolute majority of network traffic was forwarded > CHK/SSK > requests/inserts. > > Results are interesting enough: CHK traffic becomes as low as 5% (in > packets) of CHK+SSK, while at least 92% of SSK requests were not satisfied > for assorted failures (plus quite some more certainly resulted in NotFound > response due to missing the key in whole network, but I don't have the > number). This makes low traffic node working highly inefficient and > improportionally slow; this also slows down its peers with all the extra > reject traffic. Worse, input bandwidth sometimes goes over set limit, > suggesting that on hardware 33600/56000 Kbps modem and even ISDN things > will just get worse due to increased delays. > > Another thing to note: low bandwidth node (LBN) almost exclusively reject > requests with "input bandwidth liability" reason, and extremely rarely > other reasons. > > Speculating a bit, the same picture will likely be observed for peers of > fast node (1Mbps or more) with many peers having typical home connection of > 256Kbps or less. > > Not sure if simulations ever showed anything like this, but contributing to > network mostly SSK service (and absolute majority of SSK requests fail!) is > rather useless: optimally working network is supposed to transfer at least > one CHK block for each SSK key, and typically much much more (single 10MB > file consists of 481 CHK blocks!), and even if you found SSK but not CHK > the SSK points to, then you failed to find information you requested. > > OK to make the long story short[er], at the end of this message you will > find a small patch that noticably improves LBN situation. Idea is to > reserve some bandwidth for CHK transfers (and SSK inserts, as those are too > rare to penalize, and more valuable than requests). The line directly > before the inserted one implicitly penalizes CHK transfers (as much smaller > SSK requests tend to rereserve bandwidth the next moment it got released > after CHK transfer finish, while much larger CHK requests do not have such > good chance), so bandwidth should be reserved for 2 CHKs at least (and > tests show that's enough to make a difference). > > Another thing I tried was increasing the 90 seconds period up to 120. That > had some (no numbers here; just "noticeable but small") positive effect on > making traffic smoother and staying closer to set limit, without jumping up > and down too much. Where the 90 seconds number came from anyway, and how > dangerous 120 could be? > > Some pros observed and/or thought out during tests of the patch: > - I observe increase of output payload by approx. 15% (of total traffic), > making LBN more useful for its peers. > - the change is negligibly small for faster nodes so should not break > anything globally. > - entire network SSK flood traffic will be toned down a little bit (at > temporary overloaded nodes only), additionally simplifying life for LBNs: > after all, requesting the same SSK every 15 seconds for 35 hours, total > 8100 times (factual numbers from one of the test before this patch applied; > there are over 60 other SSKs that were requested more than 1000 times > during the same period) is just way too much, SSKs are not inserted into > network THAT fast. [does it worth to remember recently seen SSK requests, > and do not forward them if same request was already forwarded within last > 10 minutes and resulted in DNF/RNF? Table of recently requested SSKs that > are closest to the node location should not be too big?]. > > And contras: > - in exceptional conditions (specificially, with less than 2 incoming CHK > requests per 90 seconds; factually I observe 2-7 CHK requests per seconds, > that's 180-630 per 90 seconds) notwithstanding node bandwidth speed, up to > 800 Bps might end being unused. For high bandwidth node that's just way too > small to notice, for LBN that's still acceptable (10% of 56Kbps) and will > decrease roundtrip delays a bit which is always a good thing for so slow > links. > > Other notes: > - distribution of location closeness/number of SSK requests is very nice: > only SSK requests with location very close to node location get repeated > frequently; farther SSK location is, less requests the node sees, with > those SSKs seen only once or two times per 1-2 days period are distributed > evenly among location space. This suggests that routing is working fine. > - As far as I understand, if input bandwidth limit/liability exceeded (but > a packet already received anyway), CHK/SSK request gets instantly rejected > (thus throwing out received bytes while input bandwidth has no spare > volume!); only otherwise node checks if the requested key exists in the > storage. Heh? This feels like a serious bug hurting overall network > performance: better query storage and hopefully send back result (or still > reject if the key not found locally) rather than wait for retry request to > waste more input bandwidth. At least for SSK reject and reply are > comparable in output bandwidth usage, so worth a little delay in response. > Or do I miss something? > > === > diff --git a/freenet/src/freenet/node/NodeStats.java > b/freenet/src/freenet/node/NodeStats.java > index 3b091b4..fb9f8b9 100644 > --- a/freenet/src/freenet/node/NodeStats.java > +++ b/freenet/src/freenet/node/NodeStats.java > @@ -414,6 +414,7 @@ public class NodeStats implements Persistable { > > successfulChkInsertBytesReceivedAverage.currentValue() * > node.getNumCHKInserts() + > > successfulSskInsertBytesReceivedAverage.currentValue() * > node.getNumSSKInserts(); > bandwidthLiabilityInput += getSuccessfulBytes(isSSK, > isInsert, true).currentValue(); > + if (isSSK && !isInsert) > bandwidthLiabilityInput+=successfulChkFetchBytesReceivedAverage.currentValu >e()+successfulChkInsertBytesReceivedAverage.currentValue(); // slightly > penalize SSK requests by reserving bandwidth for 2 additional CHK transfers > (or SSK inserts if any) > double bandwidthAvailableInput = > node.getInputBandwidthLimit() * 90; // 90 seconds > at full power > if(bandwidthLiabilityInput > bandwidthAvailableInput) { > === > > ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.04.26 - 16:56:59GMT ----- > > Most SSK requests fail. They DNF. The reason for this is most SSK requests > are polling for data that has not yet been inserted. > > Bandwidth liability is usually the main reason for rejection. If we reach > most of the other reasons, there is a problem - usually a cyclical problem. > The main reason for it is to ensure that we don't accept so many requests > that some of them needlessly timeout even though they succeeded. The > timeout is 120 seconds, so we need the actual transfer to take less than > this; on a request, 30 seconds seems a reasonable upper bound for the > search time. We don't throw out many bytes when we reject a request/insert > because the bulk of it hasn't been sent yet, except with SSKs where > typically a little under half of the total bytes will have been moved. > Ideally we wouldn't send requests until we have a good idea that they will > be accepted, but token passing load balancing is a long way off, not likely > to happen for 0.7.0. > > We cannot control input bandwidth usage precisely. > > Any more info on SSK flooding? Is it simply Frost? > > We can add a failure table, we had one before, however a failure table > which results in actually blocking keys can be extremely dangerous; what I > had envisaged was "per node failure tables" i.e. reroute requests which > have recently failed to a different node since we know it isn't where it's > supposed to be. > > On what do you base the assertion about key closeness? It would be nice to > have a histogram or circle on the stats pages showing recent keys on the > keyspace - can you write a patch? > > As far as your patch goes, surely rejecting more SSK requests would be > counterproductive as it wastes bandwidth? Shouldn't a slow node accept > those requests it's likely to be able to handle? > > I can see an argument that we shouldn't prefer SSKs, and that on slow nodes > we do prefer SSKs... I'm not sure the above is the right way to deal with > it though. The effect of the patch would be to never accept any SSKs unless > we have plenty of spare bandwidth, correct? > > ----- Anonymous at o9_0DTuZniSf_+oDmRsonByWxsI ----- 2007.04.26 - > 18:41:32GMT ----- > > > Ideally we wouldn't send requests until we have a good idea that they > > will > > be accepted, but token passing load balancing is a long way off, not likely > to happen for 0.7.0. > > Well, even current algorithm implementation has certain room for > improvement. Here is the typical numbers I observe: > > === > unclaimedFIFO Message Counts > * FNPRejectOverload: 89 (45.2%) > * FNPInsertTransfersCompleted: 59 (29.9%) > * FNPDataNotFound: 15 (7.6%) > * packetTransmit: 12 (6.1%) > * FNPRejectLoop: 7 (3.6%) > * FNPAccepted: 6 (3.0%) > * FNPSwapRejected: 4 (2.0%) > * FNPDataInsertRejected: 4 (2.0%) > * FNPRouteNotFound: 1 (0.5%) > * Unclaimed Messages Considered: 197 > === > > FNPRejectOverload always stays at top sometimes with hundreds messages (for > the last hour before unclaimed messages expire, that's alot), and so > indicates that there is some bug (or bugs) with bandwidth limiting obeying. > > > Any more info on SSK flooding? Is it simply Frost? > > Not local frost for sure, it generates just several SSK simultaneous > requests (by default max 8: 6 for boards plus 2 for filesharing, AFAIR; > practically 2-4 simutaneous requests most of the time). Other 100 SSK > requests (without proposed patch) are forwarded ones. > > >We can add a failure table, we had one before, however a failure table > > which > > results in actually blocking keys can be extremely dangerous; > > Is it, having timeout of max few minutes (i.e. at least few times less than > SSK propagation time visible with frost messages)? Is it more dangerous > than current wastage of bandwith for same SSK key requests several times > per minute? Had some simulations been done on that in the past? > > BTW, isn't the observed very low store hit rate results from prioritising > the likely-to-fail SSKs? > > BTW2 the failure table could also act as a targetted content propagation > mechanism: if a node sees SSK insert for a temporary blacklisted > (non-existing) SSK, then forwarding the insert (more likely insert copy, > for security reasons and routing sake) to the original requestor should > speed up propagaton of new SSKs toward the nodes that already > anticipate/await for them. > > >what I had envisaged was "per node failure tables" i.e. reroute requests > > which have recently failed to a different node since we know it isn't where > it's supposed to be. > > At a glance, very nice idea. But LBNs typically answer with reject, not > DFN... even with current code. Probably such rerouting will even further > increase SSK traffic toward LBNs, and get sharply increased volume of SSK > rejects as result. Hmm, some testing/simulation seems really needed here. > > >On what do you base the assertion about key closeness? It would be nice to > > have a histogram or circle on the stats pages showing recent keys on the > keyspace - can you write a patch? > > Mmmm... in fact I just added custom logging, then a wild combination of > grep/sed/sort/uniq to analyze the logs. But let me think, maybe visualizing > a couple of stats files I operate with will be rather trivial... > > But I would rather stay away from stats page graphics at this time, as the > stats files I operate (filtered+sorted+uniqued) with are rather large, > 20-50Mb each - too much memory for the toy. Unless your 'recent' means just > 10-15 minutes at most? > > >As far as your patch goes, surely rejecting more SSK requests would be > > counterproductive as it wastes bandwidth? > > Tests show the opposite: without the patch payload output at stats page > never exceeded 38%, with patch it becomes 53% or little more after several > minutes upon node restart. So, with the patch SSK/CHK forwarding behaviour > 'feels' logical: > > without patch: > - just several CHKs, and over over 100 SSKs very typical. > > with patch: > - most of the time (say, 75%) number of currently forwarded CHK > requests+inserts approximately equals to the number of SSK requests+inserts > (i.e. 10-25 each, depending on set bandwidth limit); > - sometimes (say, 10%) CHK requests start to prevail, but current SSK > requests+inserts seems never go below the amount which CHK get at max > without patch (i.e. 6-10). This is very typical when number of CHK inserts > gets several times higher than CHK requests (close fast peer inserts > something really large?). > - other times (say, 15%) CHK requests+inserts flow does not saturate > bandwidth, and number of SSK requests quickly climbs to 50 or even over > 100+ as it typically gets without the patch. > > That's for LBN. Raising input bandwidth allotment, number of SSKs quickly > grows resembling the situation without the patch. > > So that's why I suggest reserving bandwidth for 2 CHK transfers; 3 would > kill SSKs, 1 still makes SSKs to seriously prevail over CHKs (but > nonetheless gives quite better ratio, so is a legal value to try if the > value of 2 alarms you too much). Just, in case of reserving bandwidth for 1 > extra CHK the proposed patch is not really needed: simply comment out the > line starting with "bandwidthLiabilityInput +=" and decrease 90 seconds > constant to 80 (10 seconds is roughly how much 33.6Kbod modem takes to > transmit a single CHK - using anything noticeably slower than > 28800/33600bod for freenet will not ever work well anyway). > > >Shouldn't a slow node accept those requests it's likely to be able to > > handle? > > Considering the very high chance of SSK request failures (at lest 92%), I > would say the answer is no. But with sane SSK failure rate (say 75% or > below) SSK requests would likely not waste the limited thus precious LBN > bandwidth so fruitlessly. > > The problem, in my belief, is too small size of UDP packets if SSK requests > prevail: PPP(oE)/TCP/FNP overhead becomes too large while LBNs, unlike > faster link nodes, almost never coalesce packets, obviously. > > ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.04.27 - 17:19:24GMT ----- > > The current algorithm is working, on most nodes, far better than it has in > *ages*. I'm at 62% of a 700MB ISO, I started inserting it yesterday > morning, and only a few of my peers are backed off - frequently none are > backed off, right now it's 11 connected, 6 backed off, which is more backed > off than I've seen for quite a while. > > Re failure tables: Yes it is extremely dangerous. It can result in > self-reinforcing key censorship, either as an attack or just occurring > naturally. This happened on 0.5. And the hit ratio is only for CHKs iirc. > > Even LBNs don't often send local RejectedOverload's on SSKs *once they have > accepted them*. They may relay downstream RO's but that is not fatal. And > if they reject some requests, so what, it's a slow node, it's bound to > reject some requests with the current load balancing system. > > 10-15 minutes would be interesting. We already show a circle and histogram > of nearby nodes from swapping and of our peers, you'd just have to add > another one. It would be good to have a visual proof that routing is > working on the level of adhering to node specialisations. I didn't expect > it to be working given the load: I'm surprised that it does, it's an > interesting result. > > Packet size has nothing to do with it, ethernet has a 1472 byte maximum. > Dial-up has 576 bytes max, but we ignore it, and use fragmented packets > (this sucks, obviously, as it greatly increases the chance of losing a > packet and having to retransmit it). > > Please explain why the patch doesn't result in never accepting a single > SSK? > > ----- Anonymous at o9_0DTuZniSf_+oDmRsonByWxsI ----- 2007.04.27 - > 19:31:14GMT ----- > > >Packet size has nothing to do with it, ethernet has a 1472 byte maximum. > > Dial-up has 576 bytes max, but we ignore it, and use fragmented packets > (this sucks, obviously, as it greatly increases the chance of losing a > packet and having to retransmit it). > > I am talking about typical/average packet size, not MTU. LBNs, unlike > faster nodes, rarely have a chance to coalesce reject responses (over max > 100ms), and thus send improportionally more tiny packets resulting in much > higher protocols overhead. Thus having LBNs to mostly cater SSKs not CHKs > results in lowest imaginable usefulness of LBNs for network as a whole. > > BTW in my experience typical/default dialup/PPP MTU is 1500 minus link > level headers, like ethernet. 576 is a reasonable adjustment for > interactive traffic like ssh but I fail to remember if it was used as > default since the time the super fast 28800 bod modems became common. :) > 1400+ is the typical size of GPRS PPP packets too, and the same holds true > for other popular wireless mediae like BlueTooth or WiFi; so I have no > concerns regarding IP fragmentation. > > > Please explain why the patch doesn't result in never accepting a single > > SSK? > > I can not. :) Can you explain why the current code that penalizes CHKs > still gives 5% for them, even if CHKs are 25 times larger and similarly > less frequent so have really hard time to arrive at the exact moment when > bandwidth liability is not maxed out? > > Seriously, I believe that goes with 2 facts: > > - SSK requests are much more frequent, so any temporary drop of CHK > requests level enables node to quickly get a bunch of new SSKs accepted for > processing; > - the large CHK requests (at times while they prevail over SSKs) tend to > hit other limits too, like "output bandwidth liability", "Insufficient > input/output bandwidth" throttles. Then the small SSK requests quickly pick > up all the remaining bandwidth bits. > > But currently I do not have relevant statistics to prove that. > > Anyway, please commit the following patch - it should equal out bandwidth > rights for CHKs and SSKs at least half way toward fair/expected > distribution (and the change will make any difference for high-/over-loaded > nodes only). Once most of my peers (and their peers) update, I will study > the new node traffic forwarding efficiency and behavior at different > bandwidth limits and with different penalization levels again - and then > will be in better position to prove the original proposal of reserving > bandwidth for 2 CHKs is optimal (or maybe withdraw it). > > === > diff --git a/freenet/src/freenet/node/NodeStats.java > b/freenet/src/freenet/node/NodeStats.java > index 3b091b4..98c82c3 100644 > --- a/freenet/src/freenet/node/NodeStats.java > +++ b/freenet/src/freenet/node/NodeStats.java > @@ -399,9 +399,8 @@ public class NodeStats implements Persistable { > successfulSskFetchBytesSentAverage.currentValue() * > node.getNumSSKRequests() + > successfulChkInsertBytesSentAverage.currentValue() > * node.getNumCHKInserts() + > successfulSskInsertBytesSentAverage.currentValue() > * node.getNumSSKInserts(); > - bandwidthLiabilityOutput += getSuccessfulBytes(isSSK, > isInsert, false).currentValue(); > double bandwidthAvailableOutput = > - node.getOutputBandwidthLimit() * 90; // 90 seconds > at full power; we have to leave some time for the search as well > + node.getOutputBandwidthLimit() * 80; // 80 seconds > at full power; we have to leave some time for the search as well > bandwidthAvailableOutput *= > NodeStats.FRACTION_OF_BANDWIDTH_USED_BY_REQUESTS; > if(bandwidthLiabilityOutput > bandwidthAvailableOutput) { > preemptiveRejectReasons.inc("Output bandwidth > liability"); > @@ -413,9 +412,8 @@ public class NodeStats implements Persistable { > > successfulSskFetchBytesReceivedAverage.currentValue() * > node.getNumSSKRequests() + > > successfulChkInsertBytesReceivedAverage.currentValue() * > node.getNumCHKInserts() + > > successfulSskInsertBytesReceivedAverage.currentValue() * > node.getNumSSKInserts(); > - bandwidthLiabilityInput += getSuccessfulBytes(isSSK, > isInsert, true).currentValue(); > double bandwidthAvailableInput = > - node.getInputBandwidthLimit() * 90; // 90 seconds > at full power > + node.getInputBandwidthLimit() * 80; // 80 seconds > at full power > if(bandwidthLiabilityInput > bandwidthAvailableInput) { > preemptiveRejectReasons.inc("Input bandwidth > liability"); > return "Input bandwidth liability"; > === > > ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.04.28 - 17:05:53GMT ----- > > Why does assuming 80 seconds instead of 90 help? I would have expected it > to make the situation worse. > > Isn't what you want to increment the value you are multiplying the CHK byte > counters by if the request is an SSK? In any case I'm not convinced - we > accept 32x as many SSKs as CHKs precisely because they use 32x less > bandwidth. As far as I can see incrementing the CHK counts but only on a > CHK would just result in us never accepting an SSK... > > But by all means continue to investigate. > > ----- mrogers at UU62+3E1vKT1k+7fR0Gx7ZN2IB0 ----- 2007.04.30 - 19:36:36GMT > ----- > > > we accept 32x as many SSKs as CHKs precisely because they use 32x less > > bandwidth. > > Sorry but I don't understand the rationale behind this. It seems to be > based on the assumption that equal resources should be allocated to SSKs > and CHKs, regardless of whether there's equal demand for resources. If > we're only getting, say, 16 times as many SSK requests as CHK requests, > would we reject CHK requests to keep things "fair"? > > ----- toad at zceUWxlSaHLmvEMnbr4RHnVfehA ----- 2007.05.02 - 16:13:52GMT ----- > > Why should CHKs be prioritised over SSKs? > > What do you think of the patch I committed anyway? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/tech/attachments/20070808/5ed5214f/attachment.pgp>