[email protected] said: > On 05/20/2011 12:30 PM, Pieter Hintjens wrote: > > > My own experience goes strongly against handling OOM in any way except > > assertion. We explored this quite exhaustively in OpenAMQ and found > > that returning errors in case of OOM was very fragile. It is not even > > clear that an application can deal with such errors sanely, since many > > system calls will themselves fail if memory is exhausted. We tried > > hard to make this work, and in the end had to choose for "assert" as > > the only robust answer. > > > > It's particularly important for services because most of the time > > there is a problem that must be raised and resolved, whether it's the > > too-low default VM size, or the lack of HWMs on queues, or too-slow > > subscribers, etc. > > > > The only exception to assertion, afaics, is for allocation requests > > that are clearly unreasonable. And even then, assertion seems the > > right response if these requests are internal. If they're driven by > > user data (i.e. someone sending a 4GB message to a service), the > > correct response is detecting over-sized messages and discarding them > > (and we have this code in 2.2 and 3.0). > > > > tl,dr - +1 for asserting on OOM, -1 for returning ENOMEM. > > +1 for asserts > > Still, some heuristics on handling OOM can be used. Say "if you can't > allocate engine for a new connection, close the connection". Assert only > if closing the connection fails.
I've no idea which is the better approach here; assertions are generally the easier way out. Having said that, e.g. the system malloc() generally does not assert if it cannot allocate memory. I'd suggest that a good guideline would be: 1) If it is possible to clearly return ENOMEM to the calling API, do so. This counts for user-initiated allocations. 2) If not, e.g. the allocation is internal and has no clear "caller", then an assertion is probably the best option unless there is a clear recovery path (e.g. drop the connection). > > The obvious question is whether it's good for anything. Even if we are > able to recover from this allocation failure, a next one is likely to > happen immediately afterwards. And I am not even mentioning that the > process is most likely to be in OOM killer's crosshairs at that point. Minor point: That's assuming the OOM condition is due to a system OOM condition also; at that point all bets are off. However, the OOM condition could also be caused by a resource limit set by the administrator. -mato _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
