#25347: Tor stops building circuits, and doesn't start when it has enough directory information -------------------------------------------------+------------------------- Reporter: teor | Owner: asn Type: defect | Status: | needs_revision Priority: Medium | Milestone: Tor: | 0.3.3.x-final Component: Core Tor/Tor | Version: Tor: | 0.3.0.6 Severity: Normal | Resolution: Keywords: 031-backport, 032-backport, | Actual Points: 033-must, tor-guard, tor-client, tbb- | usability-website, tbb-needs, | 033-triage-20180320, 033-included-20180320 | Parent ID: #21969 | Points: 1 Reviewer: | Sponsor: -------------------------------------------------+-------------------------
Comment (by s7r): Replying to [comment:10 asn]: > Looking at your logs, it seems like your guard rejected about 230 new circuit creations in 15 minutes with the excuse of `RESOURCELIMIT`. And your client just kept making more and more circuits to the same guard that were getting rejected... I've also noticed this exact same behavior on a client of mine recently. > I see that as well, but this happens more often and Tor has no problems in switching to guard 2/3 or even guard 3/3 to maintain functionality. This time (it happens rarely) it completely remained in this useless state. > My theory on why `RESOURCELIMIT` was used by your guard (given that you say that DoS patch was disabled) is that `assign_onionskin_to_cpuworker()` failed because `onion_pending_add()` failed because `have_room_for_onionskin()` failed. That means that the relay was overworked and had way too many cells to process at that time. Unfortunately, I can't see whether you are sending NTOR or TAP cells given your logs. > I know for sure the DoS patch is not related because I triple checked all 3 primary guards and not even one of them was running a Tor version that includes the DoS patch we merged. I think I was using only NTOR cells, because I was only trying to reach check.tpo and duckduckgo clearnet websites. > Like you said, I think the most obvious misbehavior here is that you keep on hassling your guard even tho it's telling you to relax by sending your `RESOURCELIMIT` `DESTROY` cells. Perhaps one approach here would be to choose a different guard after a guard has sent us `RESOURCELIMIT` cells, in an attempt to unclog the guard and to get better service. '''Let's think about this some more:''' > > What's the best behavior here? Should we mark the guard as down after receiving a single `RESOURCELIMIT` cell, or should we hassle the guard a bit before giving up? > This is the most important part we need to take care of. I dislike the idea to remove the guard after receiving a single `RESOURCELIMIT` cell. At least we should retry it after some time using the exponential backoff exactly as we do when one of our primary guards is not running or not listed, and maintain the same logic, timing and behavior so we don't have to maintain more branches. > Most importantly, can we make sure that the `DESTROY` cell came from the guard and not from some other node in the path? If we can make sure that the `DESTROY` cell came from the guard, this seem to me like a pretty safe countermeasure since we should trust the guard to tell us whether it's overworked or not. > As I can understand from arma's comment the `DESTROY` cell can only come from the guard. > WRT timeline here, I think working on this countermeasure (mark guard as down when overworked to get better service) seems like a plausible goal for 033, but anything more involved will probably need to wait for 034. > > Would appreciate feedback from Nick or Tim here :) > > ---- > > I still can't explain why you managed to bootstrap after hacking your state file tho. Perhaps a coincidence? Perhaps you were overworking your guard and when you stopped, it relaxed? Perhaps the hack worked differently than you imagine? Not sure. I sincerely hope so. But it makes me think: for many hours the guard is overworked, and when I delete my state file and restart and edit again the new state file putting back all the previous 3/3 primary guards that were not allowing me to connect, it just connects fine. I don't have any evidence that there was something wrong with the state file, and I don't see what could be wrong with it, it does not make any sense. It is very hard to reproduce / catch this bug in the wild. -- Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25347#comment:23> Tor Bug Tracker & Wiki <https://trac.torproject.org/> The Tor Project: anonymity online
_______________________________________________ tor-bugs mailing list tor-bugs@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs