Re: Outbound relayhost distribution

2011-02-25 Thread Robert Goodyear

On Feb 23, 2011, at 8:25 PM, Victor Duchovni wrote:

 On Wed, Feb 23, 2011 at 02:19:28PM -0800, Robert Goodyear wrote:
 
 I'm sorry... I was speaking lazily there. I meant a 4.X.X response
 that would cause the message to requeue and follow a retry/backoff
 rate algorithm.
 
 Mere 4XX responses to MAIL FROM:, RCPT TO:, DATA or . don't impact
 active queue scheduling. Only responses that prematurely termination the
 connection, or a 4XX banner or HELO response trigger backoff (concurrency
 reduction and possible throttling of the destination after sufficiently
 many consecutive failures).
 
 What negotiation? What problem are you trying to solve?
 
 
 Trying to load-share my edge MTAs that are relayhosts from my origin
 Postfix in a more scientific way than just hitting them at random,
 because when one becomes saturated, its weight in the probability of
 receiving another request is not reduced programmatically.
 
 This is not necessary. The random loading is fine at low loads, under
 high loads Postfix connection caching is time-based rather than message
 count based, which means that faster servers get a bigger share of the
 load. Some sites foolishly set limits on the number of deliveries per
 connection not just for suspect clients, but for all SMTP clients. They
 are doing themselves and everyone else a disservice.
 
 By negotiation, I mean the SMTP session from my origin to the relay
 wherein it might get a 4.X.X
 
 Not all 4XX responses are alike, tempfailing a message is normal,
 rejecting SMTP service is another matter. You still have not explained
 what real problem you're solving. Is this just premature optimization?
 
 -- if I can apply some logic that takes that
 one relay out of rotation for N minutes, that would be nice, because it
 would reduce chatter from subsequent retries and focus traffic on the
 other relays for a while.
 
 Have you seen problem relays in your upstream relay mix? What real
 symptoms do they exhibit and what is the observed impact on the upstream
 Postfix SMTP client?

I'm going to run some analytics on my last 12 months' worth of outbound 
messages to get more scientific with my gut instincts here. It's about 270 
million messages, and my observation is that when we have a spike of 4 or 5 
million that need to deliver at a certain point in time (surrounding a 
critical/time-sensitive product launch) that my deferred queues saturate too 
quickly. 

Again, rather than just brute-force it with more edge MTAs, I was hoping to 
devise a more deterministic way to control the internal relaying to my 
geographically-separated points of presence and shave off the few ms of 
conversation that are consumed in finding out if relay X will accept more 
messages yet.




Re: Outbound relayhost distribution

2011-02-25 Thread Victor Duchovni
On Fri, Feb 25, 2011 at 02:38:16PM -0800, Robert Goodyear wrote:

  Have you seen problem relays in your upstream relay mix? What real
  symptoms do they exhibit and what is the observed impact on the upstream
  Postfix SMTP client?
 
 I'm going to run some analytics on my last 12 months' worth of outbound
 messages to get more scientific with my gut instincts here. It's about 270
 million messages, and my observation is that when we have a spike of 4 or
 5 million that need to deliver at a certain point in time (surrounding a
 critical/time-sensitive product launch) that my deferred queues saturate
 too quickly.

20 million a month is a moderate mail flow if it is mail from ~50-100K
users spread out over the day. I would then expect no more than ~1K
messages in the deferred queue of each ~4 machines to be about the right
quantity of deferred email.

4 million messages to deliver all at once is a very different problem.

-- 
Viktor.


Re: Outbound relayhost distribution

2011-02-25 Thread Noel Jones

On 2/25/2011 4:38 PM, Robert Goodyear wrote:



I'm going to run some analytics on my last 12 months' worth of outbound 
messages to get more scientific with my gut instincts here. It's about 270 
million messages, and my observation is that when we have a spike of 4 or 5 
million that need to deliver at a certain point in time (surrounding a 
critical/time-sensitive product launch) that my deferred queues saturate too 
quickly.

Again, rather than just brute-force it with more edge MTAs, I was hoping to 
devise a more deterministic way to control the internal relaying to my 
geographically-separated points of presence and shave off the few ms of 
conversation that are consumed in finding out if relay X will accept more 
messages yet.



Standard advice for this problem is to designate an internal 
fallback_relay (which can be a whole second MX farm) to handle 
mail that isn't delivered quickly.  That way the primary 
outbound machines aren't bogged down with a clogged defer 
queue.  I think this is discussed in TUNING_README.


But you're wise to analyze prior data and determine exactly 
where the bottleneck is before wholesale restructuring.



  -- Noel Jones


Re: Outbound relayhost distribution

2011-02-25 Thread Robert Goodyear

On Feb 25, 2011, at 2:58 PM, Victor Duchovni wrote:

 On Fri, Feb 25, 2011 at 02:38:16PM -0800, Robert Goodyear wrote:
 
 Have you seen problem relays in your upstream relay mix? What real
 symptoms do they exhibit and what is the observed impact on the upstream
 Postfix SMTP client?
 
 I'm going to run some analytics on my last 12 months' worth of outbound
 messages to get more scientific with my gut instincts here. It's about 270
 million messages, and my observation is that when we have a spike of 4 or
 5 million that need to deliver at a certain point in time (surrounding a
 critical/time-sensitive product launch) that my deferred queues saturate
 too quickly.
 
 20 million a month is a moderate mail flow if it is mail from ~50-100K
 users spread out over the day. I would then expect no more than ~1K
 messages in the deferred queue of each ~4 machines to be about the right
 quantity of deferred email.
 
 4 million messages to deliver all at once is a very different problem.

It is definitely a lumpy distribution -- probably 2 to 3 per month of ~4-5 
million to North American subscribers, interspersed with smaller regional 
(outside North America) campaigns of 250-300K that sometimes coincide with one 
of the big campaigns. Of course I could start building stovepipes in my 
topology to isolate activity so one doesn't affect the other, but then 
conversely I might have cold MTAs sitting idle when I could be using them. I 
*do* have some regional points of presence where I have MTAs close to the 
subscribers for their markets, e.g.: UK, EU and SE Asia; maybe I should 
experiment with offloading deferred North America queues to them. I wonder if 
their inherent latency would act as a rate limiter of sorts that would play 
more nicely with recipient domains?

Anyway I'm speculating... let me go crazy with SPSS and look for some absolute 
patterns in the last year here.




Re: Outbound relayhost distribution

2011-02-25 Thread fakessh @
the quantity of deferred is yahoo response : this as that that is this
Le vendredi 25 février 2011 à 15:29 -0800, Robert Goodyear a écrit :
 On Feb 25, 2011, at 2:58 PM, Victor Duchovni wrote:
 
  On Fri, Feb 25, 2011 at 02:38:16PM -0800, Robert Goodyear wrote:
  
  Have you seen problem relays in your upstream relay mix? What real
  symptoms do they exhibit and what is the observed impact on the upstream
  Postfix SMTP client?
  
  I'm going to run some analytics on my last 12 months' worth of outbound
  messages to get more scientific with my gut instincts here. It's about 270
  million messages, and my observation is that when we have a spike of 4 or
  5 million that need to deliver at a certain point in time (surrounding a
  critical/time-sensitive product launch) that my deferred queues saturate
  too quickly.
  
  20 million a month is a moderate mail flow if it is mail from ~50-100K
  users spread out over the day. I would then expect no more than ~1K
  messages in the deferred queue of each ~4 machines to be about the right
  quantity of deferred email.
  
  4 million messages to deliver all at once is a very different problem.
 
 It is definitely a lumpy distribution -- probably 2 to 3 per month of ~4-5 
 million to North American subscribers, interspersed with smaller regional 
 (outside North America) campaigns of 250-300K that sometimes coincide with 
 one of the big campaigns. Of course I could start building stovepipes in my 
 topology to isolate activity so one doesn't affect the other, but then 
 conversely I might have cold MTAs sitting idle when I could be using them. I 
 *do* have some regional points of presence where I have MTAs close to the 
 subscribers for their markets, e.g.: UK, EU and SE Asia; maybe I should 
 experiment with offloading deferred North America queues to them. I wonder if 
 their inherent latency would act as a rate limiter of sorts that would play 
 more nicely with recipient domains?
 
 Anyway I'm speculating... let me go crazy with SPSS and look for some 
 absolute patterns in the last year here.
 
 
-- 
gpg --keyserver pgp.mit.edu --recv-key 092164A7
http://pgp.mit.edu:11371/pks/lookup?op=getsearch=0x092164A7


signature.asc
Description: Ceci est une partie de message	numériquement signée


Re: Outbound relayhost distribution

2011-02-23 Thread Noel Jones

On 2/22/2011 11:44 PM, Robert Goodyear wrote:


On Feb 22, 2011, at 9:06 PM, Noel Jones wrote:


On 2/22/2011 9:29 PM, Robert Goodyear wrote:



The postfix connection caching algorithm will automatically limit the damage 
caused by a subset of slow-responding relayhosts.


I suppose there's a tipping point of how much it would queue in memory versus 
paging it off to disk, right? Or would, for example, our friends at (name your 
favorite ISP here) decide to greylist us and just saturate  the deferred queue 
such that the subset becomes the majority? In other words, is Postfix's 
algorithm written to prevent exclusive saturation by reserving some percentage 
of its allocated limits for !=(grouchyISP)


That was referring to mail for a single destination, such as a 
relayhost farm.  For general high-volume internet delivery, 
it's generally recommended to have an internal fallback_relay 
or two that collects mail that can't be delivered right away. 
 This keeps the defer queue low and keeps slow/dead 
destinations for hogging the active queue.  This is covered in 
the docs referenced earlier.



I just had a thought, however... I wonder if I can mess with the backoff 
behavior of my edge MTAs to tell my origin server to cool it a bit in response 
to its (the edge's) workload? Will MX parity cause Postfix to hear the backoff 
request and move on to another equally-weighted server, or will it just defer 
the message and mark it as being destined for the exact same server that it 
handshook with and got the backoff request from?


Postfix does not remember the last host tried for a 
destination, and looks up the MX each time the message is 
scheduled for delivery.  Most of the grouchy ISPs have some 
sort of registration program for high-volume senders so you're 
not throttled or blacklisted.







The default settings should give very good performance.  For knobs to twist, 
please see:
http://www.postfix.org/TUNING_README.html#mailing_tips
http://www.postfix.org/QSHAPE_README.html
http://www.postfix.org/QSHAPE_README.html#backlog


  -- Noel Jones


Thanks for those links. It's easy to get bogged down in theory and not RTFM 
from the top again.




Re: Outbound relayhost distribution

2011-02-23 Thread Robert Goodyear

On Feb 23, 2011, at 4:27 AM, Noel Jones wrote:

 On 2/22/2011 11:44 PM, Robert Goodyear wrote:
 
 On Feb 22, 2011, at 9:06 PM, Noel Jones wrote:
 
 On 2/22/2011 9:29 PM, Robert Goodyear wrote:
 
 The postfix connection caching algorithm will automatically limit the 
 damage caused by a subset of slow-responding relayhosts.
 
 I suppose there's a tipping point of how much it would queue in memory 
 versus paging it off to disk, right? Or would, for example, our friends at 
 (name your favorite ISP here) decide to greylist us and just saturate  the 
 deferred queue such that the subset becomes the majority? In other words, is 
 Postfix's algorithm written to prevent exclusive saturation by reserving 
 some percentage of its allocated limits for !=(grouchyISP)
 
 That was referring to mail for a single destination, such as a relayhost 
 farm.  For general high-volume internet delivery, it's generally recommended 
 to have an internal fallback_relay or two that collects mail that can't be 
 delivered right away.  This keeps the defer queue low and keeps slow/dead 
 destinations for hogging the active queue.  This is covered in the docs 
 referenced earlier.

Ah right... I'll add a couple more fallbacks. I'm only running one behind six 
edge MTAs so I should increase the ratio as I add more edge MTAs.


 
 I just had a thought, however... I wonder if I can mess with the backoff 
 behavior of my edge MTAs to tell my origin server to cool it a bit in 
 response to its (the edge's) workload? Will MX parity cause Postfix to hear 
 the backoff request and move on to another equally-weighted server, or will 
 it just defer the message and mark it as being destined for the exact same 
 server that it handshook with and got the backoff request from?
 
 Postfix does not remember the last host tried for a destination, and looks up 
 the MX each time the message is scheduled for delivery.  Most of the grouchy 
 ISPs have some sort of registration program for high-volume senders so you're 
 not throttled or blacklisted.

Oh trust me, I have a one person on my team who spends most of his time 
maintaining unsubscribes, blacklists, feedback loops, whitelist requests, RBL 
pleadings (it's amazing how many RBLs spring up out there because someone 
thinks they've invented a better mousetrap. Unfortunately it only impacts 
legitimate senders and costs us time and money, whereas the spammers are 
unperturbed and just stay one step ahead)

So: the message is ready to send. Postfix queries DNS for my smarthost entry 
and gets MTA1 = 10, MTA2 = 10. Postfix opens a connection to MTA1 which 
responds with a 'not now, too busy' response. Does Postfix hold the MX record 
in memory _for the duration of THIS message's delivery attempt_ knowing that 
MTA2 is next, or does Postfix re-query to look for another peer immediately? If 
it re-queries, and knowing that SMTP randomize should return a random result, 
and if MTA1 is _not_ known to have (soft) failed just a moment ago, then 
technically the next attempt has a 50% chance of hitting MTA1 for a retry, 
right?

I guess what I'm getting at is that if there's no marking of the equal-weighted 
MX peers' responses within the context of _THIS_ message's attempts, I can't 
really load-distribute internally without the chance knocking on the same door 
twice. Which isn't a bad thing so much as me just wanting to really understand 
the line of execution.



Re: Outbound relayhost distribution

2011-02-23 Thread Victor Duchovni
On Tue, Feb 22, 2011 at 07:29:23PM -0800, Robert Goodyear wrote:

 As I understand it, the RELAYHOST parameter will allow an FQDN that,
 when bracketed, can skip MX lookup and just return the DNS result. If
 I use roundrobin A records for my mesh of MTAs out in my datacenters,
 I've got a reasonable randomization going on. If I use MX records of
 equal weight, Postfix will do the randomization (assuming I've not
 disabled that in general with the SMTP randomize param elsewhere.)

Randomization happens among all equal weight addresses. Multi-homed
hosts are implicity equal weight MX hosts, so they are also randomized.

 So, the downside of direct addressing of the roundrobin A record of
 my MTAs might be a lack of fallback that would have been described in
 my MX record. But for HA outbound stuff, I've got bigger problems if
 one of my MTAs on the edge goes down, so let's ignore that for now.

If you have enough equal-weight IPs, there is no need for lower-weight
IPs. Just run all the servers hot-hot.

 With that assumption, it seems like MX versus roundrobin A addressing
 of my MTAs is pretty much equally performant

Not pretty much, exactly as.

 save for the failover inherent in MX.

No additional failover unless you want some hosts to receive mail
only when it fails to deliver to others. I prefer hot-hot.

-- 
Viktor.


Re: Outbound relayhost distribution

2011-02-23 Thread Daniel Bromberg

On 2/23/2011 1:58 PM, Victor Duchovni wrote:

No additional failover unless you want some hosts to receive mail
only when it fails to deliver to others. I prefer hot-hot.

Just vocabulary question, what is hot-hot?

-DB



Re: Outbound relayhost distribution

2011-02-23 Thread Noel Jones

On 2/23/2011 12:50 PM, Robert Goodyear wrote:


So: the message is ready to send. Postfix queries DNS for my smarthost entry 
and gets MTA1 = 10, MTA2 = 10. Postfix opens a connection to MTA1 which 
responds with a 'not now, too busy' response. Does Postfix hold the MX record 
in memory _for the duration of THIS message's delivery attempt_ knowing that 
MTA2 is next, or does Postfix re-query to look for another peer immediately? If 
it re-queries, and knowing that SMTP randomize should return a random result, 
and if MTA1 is _not_ known to have (soft) failed just a moment ago, then 
technically the next attempt has a 50% chance of hitting MTA1 for a retry, 
right?



Postfix does a pretty good job of doing the right thing out 
of the box.


On each delivery attempt, if the first MX fails, postfix will 
try the next MX.


If all available MX's fail, the message goes to defer (or 
fallback_relay if configured).  Next time the message enters 
the active queue, the process starts again.

http://www.postfix.org/postconf.5.html#smtp_mx_address_limit
http://www.postfix.org/postconf.5.html#smtp_mx_session_limit

For the same destination, postfix will remember a dead host 
across multiple messages, and not retry a known dead host for 
a period of time.




  -- Noel Jones


Re: Outbound relayhost distribution

2011-02-23 Thread Victor Duchovni
On Wed, Feb 23, 2011 at 01:09:24PM -0600, Noel Jones wrote:

 For the same destination, postfix will remember a dead host across multiple 
 messages, and not retry a known dead host for a period of time.

No. Postfix remembers dead destinations, not dead hosts. When a live
destination is served by hosts a subset of which are down, demand
connection caching kicks in under load and reduces the frequency
of (probabilistically slow) connection attempts.

-- 
Viktor.


Re: Outbound relayhost distribution

2011-02-23 Thread Victor Duchovni
On Wed, Feb 23, 2011 at 02:02:09PM -0500, Daniel Bromberg wrote:

 No additional failover unless you want some hosts to receive mail
 only when it fails to deliver to others. I prefer hot-hot.

 Just vocabulary question, what is hot-hot?

A high-availability term, in which both sides of an HA cluster process
requests at the same time, as opposed to hot-cold, where only one side
is active at a time.

-- 
Viktor.


Re: Outbound relayhost distribution

2011-02-23 Thread Robert Goodyear

On Feb 23, 2011, at 11:17 AM, Victor Duchovni wrote:

 On Wed, Feb 23, 2011 at 02:02:09PM -0500, Daniel Bromberg wrote:
 
 No additional failover unless you want some hosts to receive mail
 only when it fails to deliver to others. I prefer hot-hot.
 
 Just vocabulary question, what is hot-hot?
 
 A high-availability term, in which both sides of an HA cluster process
 requests at the same time, as opposed to hot-cold, where only one side
 is active at a time.
 
 -- 
   Viktor.

Also known as ACTIVE/ACTIVE yes?


Re: Outbound relayhost distribution

2011-02-23 Thread Robert Goodyear

On Feb 23, 2011, at 11:16 AM, Victor Duchovni wrote:

 On Wed, Feb 23, 2011 at 01:09:24PM -0600, Noel Jones wrote:
 
 For the same destination, postfix will remember a dead host across multiple 
 messages, and not retry a known dead host for a period of time.
 
 No. Postfix remembers dead destinations, not dead hosts. When a live
 destination is served by hosts a subset of which are down, demand
 connection caching kicks in under load and reduces the frequency
 of (probabilistically slow) connection attempts.
 
 -- 
   Viktor.

I guess I'm most interested in what happens when a backoff response is sent 
back from my edge (relayhost/smarthost) MTA to my origin MTA. Since this is not 
a dead MX we're talking about, I'm trying to understand the negotiation and see 
if rapidly-expiring internal DNS TTL would actually _do_ anything or just shake 
up the randomization, which is pointless.

Re: Outbound relayhost distribution

2011-02-23 Thread Victor Duchovni
On Wed, Feb 23, 2011 at 11:49:34AM -0800, Robert Goodyear wrote:

  Postfix remembers dead destinations, not dead hosts. When a live
  destination is served by hosts a subset of which are down, demand
  connection caching kicks in under load and reduces the frequency
  of (probabilistically slow) connection attempts.
 
 I guess I'm most interested in what happens when a backoff response is
 sent back from my edge (relayhost/smarthost) MTA to my origin MTA.

I don't know what a backoff response means. There is no such term
defined in the SMTP RFC.

 Since this is not a dead MX we're talking about, I'm trying to understand
 the negotiation and see if rapidly-expiring internal DNS TTL would actually
 _do_ anything or just shake up the randomization, which is pointless.

What negotiation? What problem are you trying to solve?

-- 
Viktor.


Re: Outbound relayhost distribution

2011-02-23 Thread Robert Goodyear

On Feb 23, 2011, at 12:04 PM, Victor Duchovni wrote:

 On Wed, Feb 23, 2011 at 11:49:34AM -0800, Robert Goodyear wrote:
 
 Postfix remembers dead destinations, not dead hosts. When a live
 destination is served by hosts a subset of which are down, demand
 connection caching kicks in under load and reduces the frequency
 of (probabilistically slow) connection attempts.
 
 I guess I'm most interested in what happens when a backoff response is
 sent back from my edge (relayhost/smarthost) MTA to my origin MTA.
 
 I don't know what a backoff response means. There is no such term
 defined in the SMTP RFC.

I'm sorry... I was speaking lazily there. I meant a 4.X.X response that would 
cause the message to requeue and follow a retry/backoff rate algorithm.


 
 Since this is not a dead MX we're talking about, I'm trying to understand
 the negotiation and see if rapidly-expiring internal DNS TTL would actually
 _do_ anything or just shake up the randomization, which is pointless.
 
 What negotiation? What problem are you trying to solve?


Trying to load-share my edge MTAs that are relayhosts from my origin Postfix in 
a more scientific way than just hitting them at random, because when one 
becomes saturated, its weight in the probability of receiving another request 
is not reduced programmatically. 

By negotiation, I mean the SMTP session from my origin to the relay wherein it 
might get a 4.X.X -- if I can apply some logic that takes that one relay out of 
rotation for N minutes, that would be nice, because it would reduce chatter 
from subsequent retries and focus traffic on the other relays for a while.

I realize that just adding relays to my topology in an equal-weighted priority 
gives a _similar_ result, but that's not deterministic. 

Re: Outbound relayhost distribution

2011-02-23 Thread Victor Duchovni
On Wed, Feb 23, 2011 at 02:19:28PM -0800, Robert Goodyear wrote:

 I'm sorry... I was speaking lazily there. I meant a 4.X.X response
 that would cause the message to requeue and follow a retry/backoff
 rate algorithm.

Mere 4XX responses to MAIL FROM:, RCPT TO:, DATA or . don't impact
active queue scheduling. Only responses that prematurely termination the
connection, or a 4XX banner or HELO response trigger backoff (concurrency
reduction and possible throttling of the destination after sufficiently
many consecutive failures).

  What negotiation? What problem are you trying to solve?
 
 
 Trying to load-share my edge MTAs that are relayhosts from my origin
 Postfix in a more scientific way than just hitting them at random,
 because when one becomes saturated, its weight in the probability of
 receiving another request is not reduced programmatically.

This is not necessary. The random loading is fine at low loads, under
high loads Postfix connection caching is time-based rather than message
count based, which means that faster servers get a bigger share of the
load. Some sites foolishly set limits on the number of deliveries per
connection not just for suspect clients, but for all SMTP clients. They
are doing themselves and everyone else a disservice.

 By negotiation, I mean the SMTP session from my origin to the relay
 wherein it might get a 4.X.X

Not all 4XX responses are alike, tempfailing a message is normal,
rejecting SMTP service is another matter. You still have not explained
what real problem you're solving. Is this just premature optimization?

 -- if I can apply some logic that takes that
 one relay out of rotation for N minutes, that would be nice, because it
 would reduce chatter from subsequent retries and focus traffic on the
 other relays for a while.

Have you seen problem relays in your upstream relay mix? What real
symptoms do they exhibit and what is the observed impact on the upstream
Postfix SMTP client?

-- 
Viktor.


Re: Outbound relayhost distribution

2011-02-22 Thread Noel Jones

On 2/22/2011 9:29 PM, Robert Goodyear wrote:

I know this topic has been flogged to death, and perhaps for good reason, but 
I'm trying to determine the best outbound high-volume ecosystem for Postfix.

As I understand it, the RELAYHOST parameter will allow an FQDN that, when 
bracketed, can skip MX lookup and just return the DNS result. If I use 
roundrobin A records for my mesh of MTAs out in my datacenters, I've got a 
reasonable randomization going on. If I use MX records of equal weight, Postfix 
will do the randomization (assuming I've not disabled that in general with the 
SMTP randomize param elsewhere.)

I know that I could theoretically do some hijinx with a transport map, but that 
doesn't seem wise.

So, the downside of direct addressing of the roundrobin A record of my MTAs 
might be a lack of fallback that would have been described in my MX record. But 
for HA outbound stuff, I've got bigger problems if one of my MTAs on the edge 
goes down, so let's ignore that for now.

With that assumption, it seems like MX versus roundrobin A addressing of my 
MTAs is pretty much equally performant, save for the failover inherent in MX. 
However, should I consider backoff/retry requests from these MTAs as a winning 
proposition for letting DNS and MX do their thing properly? If so, it would 
seem that roundrobin A addressing is therefore considered harmful. Is there a 
difference in TTLs or anything else that I should consider from the origin 
Postfix server which might weigh in here?

Obviously, true load balancing would be the best option, perhaps by leveraging 
RabbitMQ and some realtime health metrics for each relay would need to affect 
the routing from Postfix to that relay, and then we're getting into a 
completely different architecture here. Using MX and SMTP negotiation the way 
it was intended is a lot simpler, but I'm just looking to optimize everything I 
can here.

Thanks in advance for any clarification.



Postfix will internally randomize either A records or 
equal-weight MX records, so it doesn't make too much 
difference which you use.  A transport_maps entry that 
resolves to either multiple A records or multiple equal-weight 
MX records will perform about the same as a relayhost setting 
(assuming the normal case of negligible time spent on 
transport_maps lookup).


The postfix connection caching algorithm will automatically 
limit the damage caused by a subset of slow-responding relayhosts.


You can increase concurrency for relayhosts under your direct 
control if they can handle the load (it's impolite to open 
dozens/hundreds of connections to someone else's server 
without prior agreement).


The default settings should give very good performance.  For 
knobs to twist, please see:

http://www.postfix.org/TUNING_README.html#mailing_tips
http://www.postfix.org/QSHAPE_README.html
http://www.postfix.org/QSHAPE_README.html#backlog


  -- Noel Jones


Re: Outbound relayhost distribution

2011-02-22 Thread Daniel Bromberg

On 2/22/2011 10:29 PM, Robert Goodyear wrote:

I know this topic has been flogged to death, and perhaps for good reason, but 
I'm trying to determine the best outbound high-volume ecosystem for Postfix.

As I understand it, the RELAYHOST parameter will allow an FQDN that, when 
bracketed, can skip MX lookup and just return the DNS result. If I use 
roundrobin A records for my mesh of MTAs out in my datacenters, I've got a 
reasonable randomization going on. If I use MX records of equal weight, Postfix 
will do the randomization (assuming I've not disabled that in general with the 
SMTP randomize param elsewhere.)

I know that I could theoretically do some hijinx with a transport map, but that 
doesn't seem wise.

So, the downside of direct addressing of the roundrobin A record of my MTAs 
might be a lack of fallback that would have been described in my MX record. But 
for HA outbound stuff, I've got bigger problems if one of my MTAs on the edge 
goes down, so let's ignore that for now.

With that assumption, it seems like MX versus roundrobin A addressing of my 
MTAs is pretty much equally performant, save for the failover inherent in MX. 
However, should I consider backoff/retry requests from these MTAs as a winning 
proposition for letting DNS and MX do their thing properly? If so, it would 
seem that roundrobin A addressing is therefore considered harmful. Is there a 
difference in TTLs or anything else that I should consider from the origin 
Postfix server which might weigh in here?

Obviously, true load balancing would be the best option, perhaps by leveraging 
RabbitMQ and some realtime health metrics for each relay would need to affect 
the routing from Postfix to that relay, and then we're getting into a 
completely different architecture here. Using MX and SMTP negotiation the way 
it was intended is a lot simpler, but I'm just looking to optimize everything I 
can here.

Thanks in advance for any clarification.

I disclaim that there are some hackishnesses in my suggestion, but:

You could leverage the built-in priority of MX records with a custom DNS 
load balancer.  The postfix server could be configured to use as its 
exclusive downstream DNS source a rather fickle, private, exclusive, DNS 
server set to serve its personal MX records with a TTL of, say, 1 
minute, and otherwise behave normally for the public internet domain 
space. This private DNS server's zone file would be scriptable to query 
(SNMP? MySQL tables?) your queue lengths of your MTAs and translate 
these into MX priority values.  If this is a good idea, it's probably 
been done before, and you can copy an existing technique.


-Daniel



Re: Outbound relayhost distribution

2011-02-22 Thread Robert Goodyear

On Feb 22, 2011, at 9:06 PM, Noel Jones wrote:

 On 2/22/2011 9:29 PM, Robert Goodyear wrote:
 I know this topic has been flogged to death, and perhaps for good reason, 
 but I'm trying to determine the best outbound high-volume ecosystem for 
 Postfix.
 
 As I understand it, the RELAYHOST parameter will allow an FQDN that, when 
 bracketed, can skip MX lookup and just return the DNS result. If I use 
 roundrobin A records for my mesh of MTAs out in my datacenters, I've got a 
 reasonable randomization going on. If I use MX records of equal weight, 
 Postfix will do the randomization (assuming I've not disabled that in 
 general with the SMTP randomize param elsewhere.)
 
 I know that I could theoretically do some hijinx with a transport map, but 
 that doesn't seem wise.
 
 So, the downside of direct addressing of the roundrobin A record of my MTAs 
 might be a lack of fallback that would have been described in my MX record. 
 But for HA outbound stuff, I've got bigger problems if one of my MTAs on the 
 edge goes down, so let's ignore that for now.
 
 With that assumption, it seems like MX versus roundrobin A addressing of my 
 MTAs is pretty much equally performant, save for the failover inherent in 
 MX. However, should I consider backoff/retry requests from these MTAs as a 
 winning proposition for letting DNS and MX do their thing properly? If so, 
 it would seem that roundrobin A addressing is therefore considered harmful. 
 Is there a difference in TTLs or anything else that I should consider from 
 the origin Postfix server which might weigh in here?
 
 Obviously, true load balancing would be the best option, perhaps by 
 leveraging RabbitMQ and some realtime health metrics for each relay would 
 need to affect the routing from Postfix to that relay, and then we're 
 getting into a completely different architecture here. Using MX and SMTP 
 negotiation the way it was intended is a lot simpler, but I'm just looking 
 to optimize everything I can here.
 
 Thanks in advance for any clarification.
 
 
 Postfix will internally randomize either A records or equal-weight MX 
 records, so it doesn't make too much difference which you use.  A 
 transport_maps entry that resolves to either multiple A records or multiple 
 equal-weight MX records will perform about the same as a relayhost setting 
 (assuming the normal case of negligible time spent on transport_maps lookup).

Thanks Noel, that's the validation I was looking for, w/r/t the overhead of 
either roundrobinning method by Postfix itself. 

 The postfix connection caching algorithm will automatically limit the damage 
 caused by a subset of slow-responding relayhosts.

I suppose there's a tipping point of how much it would queue in memory versus 
paging it off to disk, right? Or would, for example, our friends at (name your 
favorite ISP here) decide to greylist us and just saturate  the deferred queue 
such that the subset becomes the majority? In other words, is Postfix's 
algorithm written to prevent exclusive saturation by reserving some percentage 
of its allocated limits for !=(grouchyISP) 

 You can increase concurrency for relayhosts under your direct control if they 
 can handle the load (it's impolite to open dozens/hundreds of connections to 
 someone else's server without prior agreement).

Right, I've left the concurrency_limit alone as my bottleneck is from my edge 
MTAs to recipient domains. My goal here is to smooth out the lumps that collect 
at my edge MTAs because they are not (obviously) intelligently balanced, rather 
just load shared on the way in. 

I just had a thought, however... I wonder if I can mess with the backoff 
behavior of my edge MTAs to tell my origin server to cool it a bit in response 
to its (the edge's) workload? Will MX parity cause Postfix to hear the backoff 
request and move on to another equally-weighted server, or will it just defer 
the message and mark it as being destined for the exact same server that it 
handshook with and got the backoff request from?

 
 The default settings should give very good performance.  For knobs to twist, 
 please see:
 http://www.postfix.org/TUNING_README.html#mailing_tips
 http://www.postfix.org/QSHAPE_README.html
 http://www.postfix.org/QSHAPE_README.html#backlog
 
 
  -- Noel Jones

Thanks for those links. It's easy to get bogged down in theory and not RTFM 
from the top again.