Re: Outbound relayhost distribution
On Feb 23, 2011, at 8:25 PM, Victor Duchovni wrote: On Wed, Feb 23, 2011 at 02:19:28PM -0800, Robert Goodyear wrote: I'm sorry... I was speaking lazily there. I meant a 4.X.X response that would cause the message to requeue and follow a retry/backoff rate algorithm. Mere 4XX responses to MAIL FROM:, RCPT TO:, DATA or . don't impact active queue scheduling. Only responses that prematurely termination the connection, or a 4XX banner or HELO response trigger backoff (concurrency reduction and possible throttling of the destination after sufficiently many consecutive failures). What negotiation? What problem are you trying to solve? Trying to load-share my edge MTAs that are relayhosts from my origin Postfix in a more scientific way than just hitting them at random, because when one becomes saturated, its weight in the probability of receiving another request is not reduced programmatically. This is not necessary. The random loading is fine at low loads, under high loads Postfix connection caching is time-based rather than message count based, which means that faster servers get a bigger share of the load. Some sites foolishly set limits on the number of deliveries per connection not just for suspect clients, but for all SMTP clients. They are doing themselves and everyone else a disservice. By negotiation, I mean the SMTP session from my origin to the relay wherein it might get a 4.X.X Not all 4XX responses are alike, tempfailing a message is normal, rejecting SMTP service is another matter. You still have not explained what real problem you're solving. Is this just premature optimization? -- if I can apply some logic that takes that one relay out of rotation for N minutes, that would be nice, because it would reduce chatter from subsequent retries and focus traffic on the other relays for a while. Have you seen problem relays in your upstream relay mix? What real symptoms do they exhibit and what is the observed impact on the upstream Postfix SMTP client? I'm going to run some analytics on my last 12 months' worth of outbound messages to get more scientific with my gut instincts here. It's about 270 million messages, and my observation is that when we have a spike of 4 or 5 million that need to deliver at a certain point in time (surrounding a critical/time-sensitive product launch) that my deferred queues saturate too quickly. Again, rather than just brute-force it with more edge MTAs, I was hoping to devise a more deterministic way to control the internal relaying to my geographically-separated points of presence and shave off the few ms of conversation that are consumed in finding out if relay X will accept more messages yet.
Re: Outbound relayhost distribution
On Fri, Feb 25, 2011 at 02:38:16PM -0800, Robert Goodyear wrote: Have you seen problem relays in your upstream relay mix? What real symptoms do they exhibit and what is the observed impact on the upstream Postfix SMTP client? I'm going to run some analytics on my last 12 months' worth of outbound messages to get more scientific with my gut instincts here. It's about 270 million messages, and my observation is that when we have a spike of 4 or 5 million that need to deliver at a certain point in time (surrounding a critical/time-sensitive product launch) that my deferred queues saturate too quickly. 20 million a month is a moderate mail flow if it is mail from ~50-100K users spread out over the day. I would then expect no more than ~1K messages in the deferred queue of each ~4 machines to be about the right quantity of deferred email. 4 million messages to deliver all at once is a very different problem. -- Viktor.
Re: Outbound relayhost distribution
On 2/25/2011 4:38 PM, Robert Goodyear wrote: I'm going to run some analytics on my last 12 months' worth of outbound messages to get more scientific with my gut instincts here. It's about 270 million messages, and my observation is that when we have a spike of 4 or 5 million that need to deliver at a certain point in time (surrounding a critical/time-sensitive product launch) that my deferred queues saturate too quickly. Again, rather than just brute-force it with more edge MTAs, I was hoping to devise a more deterministic way to control the internal relaying to my geographically-separated points of presence and shave off the few ms of conversation that are consumed in finding out if relay X will accept more messages yet. Standard advice for this problem is to designate an internal fallback_relay (which can be a whole second MX farm) to handle mail that isn't delivered quickly. That way the primary outbound machines aren't bogged down with a clogged defer queue. I think this is discussed in TUNING_README. But you're wise to analyze prior data and determine exactly where the bottleneck is before wholesale restructuring. -- Noel Jones
Re: Outbound relayhost distribution
On Feb 25, 2011, at 2:58 PM, Victor Duchovni wrote: On Fri, Feb 25, 2011 at 02:38:16PM -0800, Robert Goodyear wrote: Have you seen problem relays in your upstream relay mix? What real symptoms do they exhibit and what is the observed impact on the upstream Postfix SMTP client? I'm going to run some analytics on my last 12 months' worth of outbound messages to get more scientific with my gut instincts here. It's about 270 million messages, and my observation is that when we have a spike of 4 or 5 million that need to deliver at a certain point in time (surrounding a critical/time-sensitive product launch) that my deferred queues saturate too quickly. 20 million a month is a moderate mail flow if it is mail from ~50-100K users spread out over the day. I would then expect no more than ~1K messages in the deferred queue of each ~4 machines to be about the right quantity of deferred email. 4 million messages to deliver all at once is a very different problem. It is definitely a lumpy distribution -- probably 2 to 3 per month of ~4-5 million to North American subscribers, interspersed with smaller regional (outside North America) campaigns of 250-300K that sometimes coincide with one of the big campaigns. Of course I could start building stovepipes in my topology to isolate activity so one doesn't affect the other, but then conversely I might have cold MTAs sitting idle when I could be using them. I *do* have some regional points of presence where I have MTAs close to the subscribers for their markets, e.g.: UK, EU and SE Asia; maybe I should experiment with offloading deferred North America queues to them. I wonder if their inherent latency would act as a rate limiter of sorts that would play more nicely with recipient domains? Anyway I'm speculating... let me go crazy with SPSS and look for some absolute patterns in the last year here.
Re: Outbound relayhost distribution
the quantity of deferred is yahoo response : this as that that is this Le vendredi 25 février 2011 à 15:29 -0800, Robert Goodyear a écrit : On Feb 25, 2011, at 2:58 PM, Victor Duchovni wrote: On Fri, Feb 25, 2011 at 02:38:16PM -0800, Robert Goodyear wrote: Have you seen problem relays in your upstream relay mix? What real symptoms do they exhibit and what is the observed impact on the upstream Postfix SMTP client? I'm going to run some analytics on my last 12 months' worth of outbound messages to get more scientific with my gut instincts here. It's about 270 million messages, and my observation is that when we have a spike of 4 or 5 million that need to deliver at a certain point in time (surrounding a critical/time-sensitive product launch) that my deferred queues saturate too quickly. 20 million a month is a moderate mail flow if it is mail from ~50-100K users spread out over the day. I would then expect no more than ~1K messages in the deferred queue of each ~4 machines to be about the right quantity of deferred email. 4 million messages to deliver all at once is a very different problem. It is definitely a lumpy distribution -- probably 2 to 3 per month of ~4-5 million to North American subscribers, interspersed with smaller regional (outside North America) campaigns of 250-300K that sometimes coincide with one of the big campaigns. Of course I could start building stovepipes in my topology to isolate activity so one doesn't affect the other, but then conversely I might have cold MTAs sitting idle when I could be using them. I *do* have some regional points of presence where I have MTAs close to the subscribers for their markets, e.g.: UK, EU and SE Asia; maybe I should experiment with offloading deferred North America queues to them. I wonder if their inherent latency would act as a rate limiter of sorts that would play more nicely with recipient domains? Anyway I'm speculating... let me go crazy with SPSS and look for some absolute patterns in the last year here. -- gpg --keyserver pgp.mit.edu --recv-key 092164A7 http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x092164A7 signature.asc Description: Ceci est une partie de message numériquement signée
Re: Outbound relayhost distribution
On 2/22/2011 11:44 PM, Robert Goodyear wrote: On Feb 22, 2011, at 9:06 PM, Noel Jones wrote: On 2/22/2011 9:29 PM, Robert Goodyear wrote: The postfix connection caching algorithm will automatically limit the damage caused by a subset of slow-responding relayhosts. I suppose there's a tipping point of how much it would queue in memory versus paging it off to disk, right? Or would, for example, our friends at (name your favorite ISP here) decide to greylist us and just saturate the deferred queue such that the subset becomes the majority? In other words, is Postfix's algorithm written to prevent exclusive saturation by reserving some percentage of its allocated limits for !=(grouchyISP) That was referring to mail for a single destination, such as a relayhost farm. For general high-volume internet delivery, it's generally recommended to have an internal fallback_relay or two that collects mail that can't be delivered right away. This keeps the defer queue low and keeps slow/dead destinations for hogging the active queue. This is covered in the docs referenced earlier. I just had a thought, however... I wonder if I can mess with the backoff behavior of my edge MTAs to tell my origin server to cool it a bit in response to its (the edge's) workload? Will MX parity cause Postfix to hear the backoff request and move on to another equally-weighted server, or will it just defer the message and mark it as being destined for the exact same server that it handshook with and got the backoff request from? Postfix does not remember the last host tried for a destination, and looks up the MX each time the message is scheduled for delivery. Most of the grouchy ISPs have some sort of registration program for high-volume senders so you're not throttled or blacklisted. The default settings should give very good performance. For knobs to twist, please see: http://www.postfix.org/TUNING_README.html#mailing_tips http://www.postfix.org/QSHAPE_README.html http://www.postfix.org/QSHAPE_README.html#backlog -- Noel Jones Thanks for those links. It's easy to get bogged down in theory and not RTFM from the top again.
Re: Outbound relayhost distribution
On Feb 23, 2011, at 4:27 AM, Noel Jones wrote: On 2/22/2011 11:44 PM, Robert Goodyear wrote: On Feb 22, 2011, at 9:06 PM, Noel Jones wrote: On 2/22/2011 9:29 PM, Robert Goodyear wrote: The postfix connection caching algorithm will automatically limit the damage caused by a subset of slow-responding relayhosts. I suppose there's a tipping point of how much it would queue in memory versus paging it off to disk, right? Or would, for example, our friends at (name your favorite ISP here) decide to greylist us and just saturate the deferred queue such that the subset becomes the majority? In other words, is Postfix's algorithm written to prevent exclusive saturation by reserving some percentage of its allocated limits for !=(grouchyISP) That was referring to mail for a single destination, such as a relayhost farm. For general high-volume internet delivery, it's generally recommended to have an internal fallback_relay or two that collects mail that can't be delivered right away. This keeps the defer queue low and keeps slow/dead destinations for hogging the active queue. This is covered in the docs referenced earlier. Ah right... I'll add a couple more fallbacks. I'm only running one behind six edge MTAs so I should increase the ratio as I add more edge MTAs. I just had a thought, however... I wonder if I can mess with the backoff behavior of my edge MTAs to tell my origin server to cool it a bit in response to its (the edge's) workload? Will MX parity cause Postfix to hear the backoff request and move on to another equally-weighted server, or will it just defer the message and mark it as being destined for the exact same server that it handshook with and got the backoff request from? Postfix does not remember the last host tried for a destination, and looks up the MX each time the message is scheduled for delivery. Most of the grouchy ISPs have some sort of registration program for high-volume senders so you're not throttled or blacklisted. Oh trust me, I have a one person on my team who spends most of his time maintaining unsubscribes, blacklists, feedback loops, whitelist requests, RBL pleadings (it's amazing how many RBLs spring up out there because someone thinks they've invented a better mousetrap. Unfortunately it only impacts legitimate senders and costs us time and money, whereas the spammers are unperturbed and just stay one step ahead) So: the message is ready to send. Postfix queries DNS for my smarthost entry and gets MTA1 = 10, MTA2 = 10. Postfix opens a connection to MTA1 which responds with a 'not now, too busy' response. Does Postfix hold the MX record in memory _for the duration of THIS message's delivery attempt_ knowing that MTA2 is next, or does Postfix re-query to look for another peer immediately? If it re-queries, and knowing that SMTP randomize should return a random result, and if MTA1 is _not_ known to have (soft) failed just a moment ago, then technically the next attempt has a 50% chance of hitting MTA1 for a retry, right? I guess what I'm getting at is that if there's no marking of the equal-weighted MX peers' responses within the context of _THIS_ message's attempts, I can't really load-distribute internally without the chance knocking on the same door twice. Which isn't a bad thing so much as me just wanting to really understand the line of execution.
Re: Outbound relayhost distribution
On Tue, Feb 22, 2011 at 07:29:23PM -0800, Robert Goodyear wrote: As I understand it, the RELAYHOST parameter will allow an FQDN that, when bracketed, can skip MX lookup and just return the DNS result. If I use roundrobin A records for my mesh of MTAs out in my datacenters, I've got a reasonable randomization going on. If I use MX records of equal weight, Postfix will do the randomization (assuming I've not disabled that in general with the SMTP randomize param elsewhere.) Randomization happens among all equal weight addresses. Multi-homed hosts are implicity equal weight MX hosts, so they are also randomized. So, the downside of direct addressing of the roundrobin A record of my MTAs might be a lack of fallback that would have been described in my MX record. But for HA outbound stuff, I've got bigger problems if one of my MTAs on the edge goes down, so let's ignore that for now. If you have enough equal-weight IPs, there is no need for lower-weight IPs. Just run all the servers hot-hot. With that assumption, it seems like MX versus roundrobin A addressing of my MTAs is pretty much equally performant Not pretty much, exactly as. save for the failover inherent in MX. No additional failover unless you want some hosts to receive mail only when it fails to deliver to others. I prefer hot-hot. -- Viktor.
Re: Outbound relayhost distribution
On 2/23/2011 1:58 PM, Victor Duchovni wrote: No additional failover unless you want some hosts to receive mail only when it fails to deliver to others. I prefer hot-hot. Just vocabulary question, what is hot-hot? -DB
Re: Outbound relayhost distribution
On 2/23/2011 12:50 PM, Robert Goodyear wrote: So: the message is ready to send. Postfix queries DNS for my smarthost entry and gets MTA1 = 10, MTA2 = 10. Postfix opens a connection to MTA1 which responds with a 'not now, too busy' response. Does Postfix hold the MX record in memory _for the duration of THIS message's delivery attempt_ knowing that MTA2 is next, or does Postfix re-query to look for another peer immediately? If it re-queries, and knowing that SMTP randomize should return a random result, and if MTA1 is _not_ known to have (soft) failed just a moment ago, then technically the next attempt has a 50% chance of hitting MTA1 for a retry, right? Postfix does a pretty good job of doing the right thing out of the box. On each delivery attempt, if the first MX fails, postfix will try the next MX. If all available MX's fail, the message goes to defer (or fallback_relay if configured). Next time the message enters the active queue, the process starts again. http://www.postfix.org/postconf.5.html#smtp_mx_address_limit http://www.postfix.org/postconf.5.html#smtp_mx_session_limit For the same destination, postfix will remember a dead host across multiple messages, and not retry a known dead host for a period of time. -- Noel Jones
Re: Outbound relayhost distribution
On Wed, Feb 23, 2011 at 01:09:24PM -0600, Noel Jones wrote: For the same destination, postfix will remember a dead host across multiple messages, and not retry a known dead host for a period of time. No. Postfix remembers dead destinations, not dead hosts. When a live destination is served by hosts a subset of which are down, demand connection caching kicks in under load and reduces the frequency of (probabilistically slow) connection attempts. -- Viktor.
Re: Outbound relayhost distribution
On Wed, Feb 23, 2011 at 02:02:09PM -0500, Daniel Bromberg wrote: No additional failover unless you want some hosts to receive mail only when it fails to deliver to others. I prefer hot-hot. Just vocabulary question, what is hot-hot? A high-availability term, in which both sides of an HA cluster process requests at the same time, as opposed to hot-cold, where only one side is active at a time. -- Viktor.
Re: Outbound relayhost distribution
On Feb 23, 2011, at 11:17 AM, Victor Duchovni wrote: On Wed, Feb 23, 2011 at 02:02:09PM -0500, Daniel Bromberg wrote: No additional failover unless you want some hosts to receive mail only when it fails to deliver to others. I prefer hot-hot. Just vocabulary question, what is hot-hot? A high-availability term, in which both sides of an HA cluster process requests at the same time, as opposed to hot-cold, where only one side is active at a time. -- Viktor. Also known as ACTIVE/ACTIVE yes?
Re: Outbound relayhost distribution
On Feb 23, 2011, at 11:16 AM, Victor Duchovni wrote: On Wed, Feb 23, 2011 at 01:09:24PM -0600, Noel Jones wrote: For the same destination, postfix will remember a dead host across multiple messages, and not retry a known dead host for a period of time. No. Postfix remembers dead destinations, not dead hosts. When a live destination is served by hosts a subset of which are down, demand connection caching kicks in under load and reduces the frequency of (probabilistically slow) connection attempts. -- Viktor. I guess I'm most interested in what happens when a backoff response is sent back from my edge (relayhost/smarthost) MTA to my origin MTA. Since this is not a dead MX we're talking about, I'm trying to understand the negotiation and see if rapidly-expiring internal DNS TTL would actually _do_ anything or just shake up the randomization, which is pointless.
Re: Outbound relayhost distribution
On Wed, Feb 23, 2011 at 11:49:34AM -0800, Robert Goodyear wrote: Postfix remembers dead destinations, not dead hosts. When a live destination is served by hosts a subset of which are down, demand connection caching kicks in under load and reduces the frequency of (probabilistically slow) connection attempts. I guess I'm most interested in what happens when a backoff response is sent back from my edge (relayhost/smarthost) MTA to my origin MTA. I don't know what a backoff response means. There is no such term defined in the SMTP RFC. Since this is not a dead MX we're talking about, I'm trying to understand the negotiation and see if rapidly-expiring internal DNS TTL would actually _do_ anything or just shake up the randomization, which is pointless. What negotiation? What problem are you trying to solve? -- Viktor.
Re: Outbound relayhost distribution
On Feb 23, 2011, at 12:04 PM, Victor Duchovni wrote: On Wed, Feb 23, 2011 at 11:49:34AM -0800, Robert Goodyear wrote: Postfix remembers dead destinations, not dead hosts. When a live destination is served by hosts a subset of which are down, demand connection caching kicks in under load and reduces the frequency of (probabilistically slow) connection attempts. I guess I'm most interested in what happens when a backoff response is sent back from my edge (relayhost/smarthost) MTA to my origin MTA. I don't know what a backoff response means. There is no such term defined in the SMTP RFC. I'm sorry... I was speaking lazily there. I meant a 4.X.X response that would cause the message to requeue and follow a retry/backoff rate algorithm. Since this is not a dead MX we're talking about, I'm trying to understand the negotiation and see if rapidly-expiring internal DNS TTL would actually _do_ anything or just shake up the randomization, which is pointless. What negotiation? What problem are you trying to solve? Trying to load-share my edge MTAs that are relayhosts from my origin Postfix in a more scientific way than just hitting them at random, because when one becomes saturated, its weight in the probability of receiving another request is not reduced programmatically. By negotiation, I mean the SMTP session from my origin to the relay wherein it might get a 4.X.X -- if I can apply some logic that takes that one relay out of rotation for N minutes, that would be nice, because it would reduce chatter from subsequent retries and focus traffic on the other relays for a while. I realize that just adding relays to my topology in an equal-weighted priority gives a _similar_ result, but that's not deterministic.
Re: Outbound relayhost distribution
On Wed, Feb 23, 2011 at 02:19:28PM -0800, Robert Goodyear wrote: I'm sorry... I was speaking lazily there. I meant a 4.X.X response that would cause the message to requeue and follow a retry/backoff rate algorithm. Mere 4XX responses to MAIL FROM:, RCPT TO:, DATA or . don't impact active queue scheduling. Only responses that prematurely termination the connection, or a 4XX banner or HELO response trigger backoff (concurrency reduction and possible throttling of the destination after sufficiently many consecutive failures). What negotiation? What problem are you trying to solve? Trying to load-share my edge MTAs that are relayhosts from my origin Postfix in a more scientific way than just hitting them at random, because when one becomes saturated, its weight in the probability of receiving another request is not reduced programmatically. This is not necessary. The random loading is fine at low loads, under high loads Postfix connection caching is time-based rather than message count based, which means that faster servers get a bigger share of the load. Some sites foolishly set limits on the number of deliveries per connection not just for suspect clients, but for all SMTP clients. They are doing themselves and everyone else a disservice. By negotiation, I mean the SMTP session from my origin to the relay wherein it might get a 4.X.X Not all 4XX responses are alike, tempfailing a message is normal, rejecting SMTP service is another matter. You still have not explained what real problem you're solving. Is this just premature optimization? -- if I can apply some logic that takes that one relay out of rotation for N minutes, that would be nice, because it would reduce chatter from subsequent retries and focus traffic on the other relays for a while. Have you seen problem relays in your upstream relay mix? What real symptoms do they exhibit and what is the observed impact on the upstream Postfix SMTP client? -- Viktor.
Re: Outbound relayhost distribution
On 2/22/2011 9:29 PM, Robert Goodyear wrote: I know this topic has been flogged to death, and perhaps for good reason, but I'm trying to determine the best outbound high-volume ecosystem for Postfix. As I understand it, the RELAYHOST parameter will allow an FQDN that, when bracketed, can skip MX lookup and just return the DNS result. If I use roundrobin A records for my mesh of MTAs out in my datacenters, I've got a reasonable randomization going on. If I use MX records of equal weight, Postfix will do the randomization (assuming I've not disabled that in general with the SMTP randomize param elsewhere.) I know that I could theoretically do some hijinx with a transport map, but that doesn't seem wise. So, the downside of direct addressing of the roundrobin A record of my MTAs might be a lack of fallback that would have been described in my MX record. But for HA outbound stuff, I've got bigger problems if one of my MTAs on the edge goes down, so let's ignore that for now. With that assumption, it seems like MX versus roundrobin A addressing of my MTAs is pretty much equally performant, save for the failover inherent in MX. However, should I consider backoff/retry requests from these MTAs as a winning proposition for letting DNS and MX do their thing properly? If so, it would seem that roundrobin A addressing is therefore considered harmful. Is there a difference in TTLs or anything else that I should consider from the origin Postfix server which might weigh in here? Obviously, true load balancing would be the best option, perhaps by leveraging RabbitMQ and some realtime health metrics for each relay would need to affect the routing from Postfix to that relay, and then we're getting into a completely different architecture here. Using MX and SMTP negotiation the way it was intended is a lot simpler, but I'm just looking to optimize everything I can here. Thanks in advance for any clarification. Postfix will internally randomize either A records or equal-weight MX records, so it doesn't make too much difference which you use. A transport_maps entry that resolves to either multiple A records or multiple equal-weight MX records will perform about the same as a relayhost setting (assuming the normal case of negligible time spent on transport_maps lookup). The postfix connection caching algorithm will automatically limit the damage caused by a subset of slow-responding relayhosts. You can increase concurrency for relayhosts under your direct control if they can handle the load (it's impolite to open dozens/hundreds of connections to someone else's server without prior agreement). The default settings should give very good performance. For knobs to twist, please see: http://www.postfix.org/TUNING_README.html#mailing_tips http://www.postfix.org/QSHAPE_README.html http://www.postfix.org/QSHAPE_README.html#backlog -- Noel Jones
Re: Outbound relayhost distribution
On 2/22/2011 10:29 PM, Robert Goodyear wrote: I know this topic has been flogged to death, and perhaps for good reason, but I'm trying to determine the best outbound high-volume ecosystem for Postfix. As I understand it, the RELAYHOST parameter will allow an FQDN that, when bracketed, can skip MX lookup and just return the DNS result. If I use roundrobin A records for my mesh of MTAs out in my datacenters, I've got a reasonable randomization going on. If I use MX records of equal weight, Postfix will do the randomization (assuming I've not disabled that in general with the SMTP randomize param elsewhere.) I know that I could theoretically do some hijinx with a transport map, but that doesn't seem wise. So, the downside of direct addressing of the roundrobin A record of my MTAs might be a lack of fallback that would have been described in my MX record. But for HA outbound stuff, I've got bigger problems if one of my MTAs on the edge goes down, so let's ignore that for now. With that assumption, it seems like MX versus roundrobin A addressing of my MTAs is pretty much equally performant, save for the failover inherent in MX. However, should I consider backoff/retry requests from these MTAs as a winning proposition for letting DNS and MX do their thing properly? If so, it would seem that roundrobin A addressing is therefore considered harmful. Is there a difference in TTLs or anything else that I should consider from the origin Postfix server which might weigh in here? Obviously, true load balancing would be the best option, perhaps by leveraging RabbitMQ and some realtime health metrics for each relay would need to affect the routing from Postfix to that relay, and then we're getting into a completely different architecture here. Using MX and SMTP negotiation the way it was intended is a lot simpler, but I'm just looking to optimize everything I can here. Thanks in advance for any clarification. I disclaim that there are some hackishnesses in my suggestion, but: You could leverage the built-in priority of MX records with a custom DNS load balancer. The postfix server could be configured to use as its exclusive downstream DNS source a rather fickle, private, exclusive, DNS server set to serve its personal MX records with a TTL of, say, 1 minute, and otherwise behave normally for the public internet domain space. This private DNS server's zone file would be scriptable to query (SNMP? MySQL tables?) your queue lengths of your MTAs and translate these into MX priority values. If this is a good idea, it's probably been done before, and you can copy an existing technique. -Daniel
Re: Outbound relayhost distribution
On Feb 22, 2011, at 9:06 PM, Noel Jones wrote: On 2/22/2011 9:29 PM, Robert Goodyear wrote: I know this topic has been flogged to death, and perhaps for good reason, but I'm trying to determine the best outbound high-volume ecosystem for Postfix. As I understand it, the RELAYHOST parameter will allow an FQDN that, when bracketed, can skip MX lookup and just return the DNS result. If I use roundrobin A records for my mesh of MTAs out in my datacenters, I've got a reasonable randomization going on. If I use MX records of equal weight, Postfix will do the randomization (assuming I've not disabled that in general with the SMTP randomize param elsewhere.) I know that I could theoretically do some hijinx with a transport map, but that doesn't seem wise. So, the downside of direct addressing of the roundrobin A record of my MTAs might be a lack of fallback that would have been described in my MX record. But for HA outbound stuff, I've got bigger problems if one of my MTAs on the edge goes down, so let's ignore that for now. With that assumption, it seems like MX versus roundrobin A addressing of my MTAs is pretty much equally performant, save for the failover inherent in MX. However, should I consider backoff/retry requests from these MTAs as a winning proposition for letting DNS and MX do their thing properly? If so, it would seem that roundrobin A addressing is therefore considered harmful. Is there a difference in TTLs or anything else that I should consider from the origin Postfix server which might weigh in here? Obviously, true load balancing would be the best option, perhaps by leveraging RabbitMQ and some realtime health metrics for each relay would need to affect the routing from Postfix to that relay, and then we're getting into a completely different architecture here. Using MX and SMTP negotiation the way it was intended is a lot simpler, but I'm just looking to optimize everything I can here. Thanks in advance for any clarification. Postfix will internally randomize either A records or equal-weight MX records, so it doesn't make too much difference which you use. A transport_maps entry that resolves to either multiple A records or multiple equal-weight MX records will perform about the same as a relayhost setting (assuming the normal case of negligible time spent on transport_maps lookup). Thanks Noel, that's the validation I was looking for, w/r/t the overhead of either roundrobinning method by Postfix itself. The postfix connection caching algorithm will automatically limit the damage caused by a subset of slow-responding relayhosts. I suppose there's a tipping point of how much it would queue in memory versus paging it off to disk, right? Or would, for example, our friends at (name your favorite ISP here) decide to greylist us and just saturate the deferred queue such that the subset becomes the majority? In other words, is Postfix's algorithm written to prevent exclusive saturation by reserving some percentage of its allocated limits for !=(grouchyISP) You can increase concurrency for relayhosts under your direct control if they can handle the load (it's impolite to open dozens/hundreds of connections to someone else's server without prior agreement). Right, I've left the concurrency_limit alone as my bottleneck is from my edge MTAs to recipient domains. My goal here is to smooth out the lumps that collect at my edge MTAs because they are not (obviously) intelligently balanced, rather just load shared on the way in. I just had a thought, however... I wonder if I can mess with the backoff behavior of my edge MTAs to tell my origin server to cool it a bit in response to its (the edge's) workload? Will MX parity cause Postfix to hear the backoff request and move on to another equally-weighted server, or will it just defer the message and mark it as being destined for the exact same server that it handshook with and got the backoff request from? The default settings should give very good performance. For knobs to twist, please see: http://www.postfix.org/TUNING_README.html#mailing_tips http://www.postfix.org/QSHAPE_README.html http://www.postfix.org/QSHAPE_README.html#backlog -- Noel Jones Thanks for those links. It's easy to get bogged down in theory and not RTFM from the top again.