Re: Segfaults libpcre in cleanup

2020-05-15 Thread Wietse Venema
Tobi:
> Wietse,
> 
> Am 14.05.20 um 16:04 schrieb Wietse Venema:
> > So your best bet is to rewrite your patterns.
> 
> that's what we did now and it saves a lot of cpu cycles :-) As our
> pattern matched something at beginning and at the end of the string (all
> stuff between matched via patterns with * repetition) the easiest thing
> was to add a condition to the map file for the first part and in the if
> match the end of the string. That way we could get rid of a lot of the *
> between start and end :-)
> 
> > if /^Content-(Disposition|Type).*name\s*=\s*/i

The 'i' flag makes the pattern case sensitive. You should
leave it at the default (case insensitive).

Wietse

> > /(\.|=2E)(dmg|cab)\s*(;|$)/ REJECT extension of .$2 is not allowed
> > endif
> 
> the pattern above now runs without any changes to stack size :-)
> 
> --
> Cheers
> 
> tobi
> 
> 


Re: Segfaults libpcre in cleanup

2020-05-15 Thread Tobi
Wietse,

Am 14.05.20 um 16:04 schrieb Wietse Venema:
> So your best bet is to rewrite your patterns.

that's what we did now and it saves a lot of cpu cycles :-) As our
pattern matched something at beginning and at the end of the string (all
stuff between matched via patterns with * repetition) the easiest thing
was to add a condition to the map file for the first part and in the if
match the end of the string. That way we could get rid of a lot of the *
between start and end :-)

> if /^Content-(Disposition|Type).*name\s*=\s*/i
> /(\.|=2E)(dmg|cab)\s*(;|$)/   REJECT extension of .$2 is not allowed
> endif

the pattern above now runs without any changes to stack size :-)

--
Cheers

tobi



Re: Segfaults libpcre in cleanup

2020-05-14 Thread Viktor Dukhovni
On Thu, May 14, 2020 at 02:38:09PM -0400, Wietse Venema wrote:

> Viktor Dukhovni:
> > Alternatively, we could siglongjmp() out of a segfault handler, enabled
> > around PCRE lookups, leaking whatever heap space libpcre may have
> > allocated along the way, and log a more informative message, and thereby
> > perhaps even avoid occasional service throttling in master that may happen
> > if the service is killed by a signal.  Just a thought...
> 
> After SIGSEGV the only sensible thing a process can do is to
> terminate, because the state of memory is undefined.

Yes, that's generally the case.

> Are you assuming that you can know exactly why SIGSEGV is received?
> I don't think so. We have to assume the worst, namely, that a process
> has corrupted it memory.

I am that it might not be too crazy to enable unwinding on SIGSEGV just
around pcre match calls, on the assumption that libpcre only segfauls on
stack exhaustion. 

But then, in any case, we could only log something about overly complex
patterns and _exit(0).  However, "log something" is not async-signal-safe.

So if the stack exhaustion happened not in libpcre itself, but rather in
an internal call in malloc() or similar, then we may not be able to
perform any memory allocation, so the logging before exit would require
considerable care to use static buffers and just async-signal-safe
system calls.  This would be quite a pain.

Therefore, my preference would be to attempt to configure resource
bounds for libpcre that should keep it out of trouble, perhaps at
the cost of failing on some patterns for which enough stack would
in fact have been available.

-- 
Viktor.


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Wietse Venema
Viktor Dukhovni:
> Alternatively, we could siglongjmp() out of a segfault handler, enabled
> around PCRE lookups, leaking whatever heap space libpcre may have
> allocated along the way, and log a more informative message, and thereby
> perhaps even avoid occasional service throttling in master that may happen
> if the service is killed by a signal.  Just a thought...

After SIGSEGV the only sensible thing a process can do is to
terminate, because the state of memory is undefined.

Are you assuming that you can know exactly why SIGSEGV is received?
I don't think so. We have to assume the worst, namely, that a process
has corrupted it memory.

Wietse


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Viktor Dukhovni
On Thu, May 14, 2020 at 01:40:23PM -0400, Wietse Venema wrote:

> > A cursory glance at the PCRE2 docs suggests that we can ask libpcre
> > to enforce more conservative limits before it runs out of stack, and
> > it would presumably then unwind and return a recoverable error.
> 
> That looks like a guessing game to me, because I doubt that libpcre
> really knows how much stack space remains available, as that depends
> on stack canaries and so on.
> 
> Who know what happens when it runs out. Then, it will access an
> invalid page and receive SIGSEGV.

No worse than before, but if we make a conservative guess we can
avoid the segfaults.

Alternatively, we could siglongjmp() out of a segfault handler, enabled
around PCRE lookups, leaking whatever heap space libpcre may have
allocated along the way, and log a more informative message, and thereby
perhaps even avoid occasional service throttling in master that may happen
if the service is killed by a signal.  Just a thought...

-- 
Viktor.


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Wietse Venema
Viktor Dukhovni:
> On Thu, May 14, 2020 at 10:04:44AM -0400, Wietse Venema wrote:
> 
> > Once a program runs out of stack space, that is is an unrecoverable
> > error.
> 
> A cursory glance at the PCRE2 docs suggests that we can ask libpcre
> to enforce more conservative limits before it runs out of stack, and
> it would presumably then unwind and return a recoverable error.

That looks like a guessing game to me, because I doubt that libpcre
really knows how much stack space remains available, as that depends
on stack canaries and so on.

Who know what happens when it runs out. Then, it will access an
invalid page and receive SIGSEGV.

Wietse


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Viktor Dukhovni
On Thu, May 14, 2020 at 10:04:44AM -0400, Wietse Venema wrote:

> Once a program runs out of stack space, that is is an unrecoverable
> error.

A cursory glance at the PCRE2 docs suggests that we can ask libpcre
to enforce more conservative limits before it runs out of stack, and
it would presumably then unwind and return a recoverable error.

-- 
Viktor.


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Wietse Venema
Tobi:
> So the problem is definitely our pattern in combination with a VERY big
> mime part header. We're on introducing limits where we can in our patterns.
> Anyway I think that this should not end in such an ugly error where
> postfix cleanup goes south because of such header/pattern combination.

Once a program runs out of stack space, that is is an unrecoverable
error.

I don't think that is is reasonable for Postfix to set up signal(SIGSEGV)
handlers around calls into an external (PCRE) library. Even if
Postfix did do that, a process would still have to terminate
immediately, because after SIGSEGV it is unsafe to continue execution.

Note that SIGSEGV does not take down Postfix. Postfix is designed
to recover automatically from crashing (non-master) daemon processes.

So your best bet is to rewrite your patterns.

Wietse


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Wietse Venema
Tobi:
> Hi Wietse
> 
> using your suggestion with valgrind I found that the main-stacksize
> seems to be the problem. By using --main-stacksize with different values
> I found that the last working value is 59449345, changing to ...344 let
> postmap segfault:
> 
> > valgrind -s --main-stacksize=59449344 --tool=memcheck postmap -h -q -
> > pcre:/etc/postfix/mime_header_checks.pcre  >
> > ==7988== Stack overflow in thread #1: can't grow stack to 0x1ffb74f000
> 
> We will try allowing more stacksize by /etc/security/limits.conf to
> postfix and re-design our pcre pattern(s) to avoid (where possible) the
> usage of .* without any limits :-)
> 
> Thanks a lot for your appreciated help

Good job!

Wietse


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Tobi
Wietse, Viktor

It seems that this is the problematic part of the pcre pattern

>
/^Content-(Disposition|Type).*name\s*=\s*("(?:[^"]|\\")*|[^();:,\/<>\@\"?=<>\[\]\
]*)

I tested by replacing star by star with explicit limits and found that
limits avoid the segfault. I replaced the last two stars with limits up
to 1 chars

>
/^Content-(Disposition|Type).*name\s*=\s*("(?:[^"]|\\"){0,1}|[^();:,\/<>\@\"?=<>\[\]\
]{0,1})

no segfault but I think the whole header has not been processed as its
longer than 10'000 chars. Tried with higher values than 1 but then
postmap issues a warning that the number is too big.

> postmap: warning: pcre map /etc/postfix/mime_header_checks.pcre, line
21: error in regex at offset 63: number too big in {} quantifier

After solving the "too big number" issue postmap said that the pcre
pattern is too big :-)

> postmap: warning: pcre map /etc/postfix/mime_header_checks.pcre, line
21: error in regex at offset 352: regular expression is too large


So the problem is definitely our pattern in combination with a VERY big
mime part header. We're on introducing limits where we can in our patterns.
Anyway I think that this should not end in such an ugly error where
postfix cleanup goes south because of such header/pattern combination.


--
Cheers

tobi

Am 14.05.20 um 09:13 schrieb Viktor Dukhovni:
> On Thu, May 14, 2020 at 08:53:42AM +0200, Tobi wrote:
> 
>> using your suggestion with valgrind I found that the main-stacksize
>> seems to be the problem. By using --main-stacksize with different values
>> I found that the last working value is 59449345, changing to ...344 let
>> postmap segfault:
> 
> But whatever the stacksize, the PCRE library must not segfault.  We need
> to be sure that it returns an error to the caller, rather than blows up.
> 
>>> ==7988== Stack overflow in thread #1: can't grow stack to 0x1ffb74f000
>>
>> We will try allowing more stacksize by /etc/security/limits.conf to
>> postfix and re-design our pcre pattern(s) to avoid (where possible) the
>> usage of .* without any limits :-)
> 
> Keeping your (ir)regular expressions simpler is always a good idea.
> Especially avoid costly use of multiple '.*' patterns in the same
> expression.  Not a good idea:
> 
> /a.*b.*c.*d.*...z/
> 
> Perhaps there's something we can do with default resource limits.
> From the pcre2pattern(2) manpage:
>  
>Setting match resource limits
> 
>The pcre2_match() function contains a counter that is incremented
>every time it goes round its main loop. The caller of
>pcre2_match() can set a limit on this counter, which  therefore
>limits  the  amount  of  computing resource used for a match. The
>maximum depth of nested backtracking can also be limited; this
>indirectly restricts the amount of heap memory that is used, but
>there is also an explicit memory limit that can be set.
> 
>These facilities are provided to catch runaway matches that are
>provoked by patterns with huge matching trees. A common example
>is a pattern with nested unlimited repeats applied to a long
>string that does not match. When one of these limits is reached,
>pcre2_match() gives an error return. The limits can also be set
>by items at the start of the pattern of the form
> 
>  (*LIMIT_HEAP=d)
>  (*LIMIT_MATCH=d)
>  (*LIMIT_DEPTH=d)
> 
>where  d is any number of decimal digits. However, the value of
>the setting must be less than the value set (or defaulted) by the
>caller of pcre2_match() for it to have any effect. In other
>words, the pattern writer can lower the limits set by the
>programmer, but not raise them. If there is more than one setting
>of one of these limits, the lower value is used.  The heap limit
>is specified in kibibytes (units of 1024 bytes).
> 
>Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION.
>This name is still recognized for backwards compatibility.
> 
>The  heap limit applies only when the pcre2_match() or
>pcre2_dfa_match() interpreters are used for matching. It does not
>apply to JIT. The match limit is used (but in a different way)
>when JIT is being used, or when pcre2_dfa_match() is called, to
>limit computing resource usage by those matching functions. The
>depth limit is ignored by JIT but is relevant  for  DFA matching,
>which  uses  function  recursion for recursions within the
>pattern and for lookaround assertions and atomic groups. In this
>case, the depth limit controls the depth of such recursion.
> 


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Viktor Dukhovni
On Thu, May 14, 2020 at 08:53:42AM +0200, Tobi wrote:

> using your suggestion with valgrind I found that the main-stacksize
> seems to be the problem. By using --main-stacksize with different values
> I found that the last working value is 59449345, changing to ...344 let
> postmap segfault:

But whatever the stacksize, the PCRE library must not segfault.  We need
to be sure that it returns an error to the caller, rather than blows up.

> > ==7988== Stack overflow in thread #1: can't grow stack to 0x1ffb74f000
> 
> We will try allowing more stacksize by /etc/security/limits.conf to
> postfix and re-design our pcre pattern(s) to avoid (where possible) the
> usage of .* without any limits :-)

Keeping your (ir)regular expressions simpler is always a good idea.
Especially avoid costly use of multiple '.*' patterns in the same
expression.  Not a good idea:

/a.*b.*c.*d.*...z/

Perhaps there's something we can do with default resource limits.
>From the pcre2pattern(2) manpage:
 
   Setting match resource limits

   The pcre2_match() function contains a counter that is incremented
   every time it goes round its main loop. The caller of
   pcre2_match() can set a limit on this counter, which  therefore
   limits  the  amount  of  computing resource used for a match. The
   maximum depth of nested backtracking can also be limited; this
   indirectly restricts the amount of heap memory that is used, but
   there is also an explicit memory limit that can be set.

   These facilities are provided to catch runaway matches that are
   provoked by patterns with huge matching trees. A common example
   is a pattern with nested unlimited repeats applied to a long
   string that does not match. When one of these limits is reached,
   pcre2_match() gives an error return. The limits can also be set
   by items at the start of the pattern of the form

 (*LIMIT_HEAP=d)
 (*LIMIT_MATCH=d)
 (*LIMIT_DEPTH=d)

   where  d is any number of decimal digits. However, the value of
   the setting must be less than the value set (or defaulted) by the
   caller of pcre2_match() for it to have any effect. In other
   words, the pattern writer can lower the limits set by the
   programmer, but not raise them. If there is more than one setting
   of one of these limits, the lower value is used.  The heap limit
   is specified in kibibytes (units of 1024 bytes).

   Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION.
   This name is still recognized for backwards compatibility.

   The  heap limit applies only when the pcre2_match() or
   pcre2_dfa_match() interpreters are used for matching. It does not
   apply to JIT. The match limit is used (but in a different way)
   when JIT is being used, or when pcre2_dfa_match() is called, to
   limit computing resource usage by those matching functions. The
   depth limit is ignored by JIT but is relevant  for  DFA matching,
   which  uses  function  recursion for recursions within the
   pattern and for lookaround assertions and atomic groups. In this
   case, the depth limit controls the depth of such recursion.

-- 
Viktor.


Re: Segfaults libpcre in cleanup

2020-05-14 Thread Tobi
Hi Wietse

using your suggestion with valgrind I found that the main-stacksize
seems to be the problem. By using --main-stacksize with different values
I found that the last working value is 59449345, changing to ...344 let
postmap segfault:

> valgrind -s --main-stacksize=59449344 --tool=memcheck postmap -h -q -
> pcre:/etc/postfix/mime_header_checks.pcre 
> ==7988== Stack overflow in thread #1: can't grow stack to 0x1ffb74f000

We will try allowing more stacksize by /etc/security/limits.conf to
postfix and re-design our pcre pattern(s) to avoid (where possible) the
usage of .* without any limits :-)

Thanks a lot for your appreciated help

--
Cheers

tobi

Am 13.05.20 um 16:05 schrieb Wietse Venema:
> Tobi:
>> Hi
>>
>> as usual: thanks to Wietse :-)
>>
>> Adding the info rule to the pcre maps solved more than expected. After
>> adding the info rule postfix cleanup did not segfault anymore and the
>> mail could be accepted --> we have the source now.
>> From what I see that is a massively oversized mime part header. I
>> counted about 500 times the filename= in the mime part header as well as
>> name= contains hundreds of values. All nicely utf8
>>
>> Like
>>
>>> Content-Type: image/png; name="=?UTF-
>>> 8?B?xILChMOCwoLEgsKCw4LChMOEwoLDgsKCxILCgsOCwoLEgsKEw4LCgsSC?=
>>> =?UTF-8?B?woLDgsKCw4TCgsOCwoLEgsKCw4LChMOEwoLDgsKExILCgsOCwoLDhMKC?=
>>> =?UTF-8?B?w4LCgsSCwoLDgsKCxILChMOCwoLEgsKCw4LCgsOEwoLDgsKCxILCgsOC?=
>>
>> and
>>
>>> Content-Disposition: inline; filename*0*=UTF-8''%C4%82%C2%84%C3%82%C2
>>> %82%C4%82%C2%82%C3%82%C2%84%C3%84;
>>> filename*1*=%C2%82%C3%82%C2%82%C4%82%C2%82%C3%82%C2%82%C4%82%C2%84
>>> %C3%82;
>>
>> First of all any idea why cleanup did not segfault with the info rule in
>> place?
> 
> When something accesses wrong memory, what happens next is totally
> dependent on memory layout, execution history, and so on. For example
> the program may not crash (but still produce incorrect behavior).
> 
>> 2nd: could such mime headers be the reason for a pcre pattern to let
>> libpcre segfault?
> 
> Sure, there are sometimes bugfixes for libpcre.
> 
> You can test such things.
> 
> $ postmap -h -q - pcre:pattern-file < data-file
> 
> $ valgrind --tool=memcheck postmap -h -q - pcre:pattern-file < data-file
> 
> where data-file contains the complete crashing message header and
> pattern-file contains the crashing pattern.
> 
> Note that Postfix logs only a limited portion of very large lines.
> You may need to "postcat" the original message if it is available.
> 
> valgrind will report incorrect memory access even when a program
> does not crash. And yes, I do sometimes run Postfix daemons under
> valgrind to find non-crashing bugs.
> 
> I'd be interested if you can reproduce the bug.
> 
>   Wietse
> 


Re: Segfaults libpcre in cleanup

2020-05-13 Thread Wietse Venema
Tobi:
> Hi
> 
> as usual: thanks to Wietse :-)
> 
> Adding the info rule to the pcre maps solved more than expected. After
> adding the info rule postfix cleanup did not segfault anymore and the
> mail could be accepted --> we have the source now.
> From what I see that is a massively oversized mime part header. I
> counted about 500 times the filename= in the mime part header as well as
> name= contains hundreds of values. All nicely utf8
> 
> Like
> 
> > Content-Type: image/png; name="=?UTF-
> > 8?B?xILChMOCwoLEgsKCw4LChMOEwoLDgsKCxILCgsOCwoLEgsKEw4LCgsSC?=
> > =?UTF-8?B?woLDgsKCw4TCgsOCwoLEgsKCw4LChMOEwoLDgsKExILCgsOCwoLDhMKC?=
> > =?UTF-8?B?w4LCgsSCwoLDgsKCxILChMOCwoLEgsKCw4LCgsOEwoLDgsKCxILCgsOC?=
> 
> and
> 
> > Content-Disposition: inline; filename*0*=UTF-8''%C4%82%C2%84%C3%82%C2
> > %82%C4%82%C2%82%C3%82%C2%84%C3%84;
> > filename*1*=%C2%82%C3%82%C2%82%C4%82%C2%82%C3%82%C2%82%C4%82%C2%84
> > %C3%82;
> 
> First of all any idea why cleanup did not segfault with the info rule in
> place?

When something accesses wrong memory, what happens next is totally
dependent on memory layout, execution history, and so on. For example
the program may not crash (but still produce incorrect behavior).

> 2nd: could such mime headers be the reason for a pcre pattern to let
> libpcre segfault?

Sure, there are sometimes bugfixes for libpcre.

You can test such things.

$ postmap -h -q - pcre:pattern-file < data-file

$ valgrind --tool=memcheck postmap -h -q - pcre:pattern-file < data-file

where data-file contains the complete crashing message header and
pattern-file contains the crashing pattern.

Note that Postfix logs only a limited portion of very large lines.
You may need to "postcat" the original message if it is available.

valgrind will report incorrect memory access even when a program
does not crash. And yes, I do sometimes run Postfix daemons under
valgrind to find non-crashing bugs.

I'd be interested if you can reproduce the bug.

Wietse


Re: Segfaults libpcre in cleanup

2020-05-13 Thread Tobi
Hi

as usual: thanks to Wietse :-)

Adding the info rule to the pcre maps solved more than expected. After
adding the info rule postfix cleanup did not segfault anymore and the
mail could be accepted --> we have the source now.
From what I see that is a massively oversized mime part header. I
counted about 500 times the filename= in the mime part header as well as
name= contains hundreds of values. All nicely utf8

Like

> Content-Type: image/png; name="=?UTF-
> 8?B?xILChMOCwoLEgsKCw4LChMOEwoLDgsKCxILCgsOCwoLEgsKEw4LCgsSC?=
> =?UTF-8?B?woLDgsKCw4TCgsOCwoLEgsKCw4LChMOEwoLDgsKExILCgsOCwoLDhMKC?=
> =?UTF-8?B?w4LCgsSCwoLDgsKCxILChMOCwoLEgsKCw4LCgsOEwoLDgsKCxILCgsOC?=

and

> Content-Disposition: inline; filename*0*=UTF-8''%C4%82%C2%84%C3%82%C2
> %82%C4%82%C2%82%C3%82%C2%84%C3%84;
> filename*1*=%C2%82%C3%82%C2%82%C4%82%C2%82%C3%82%C2%82%C4%82%C2%84
> %C3%82;

First of all any idea why cleanup did not segfault with the info rule in
place?
2nd: could such mime headers be the reason for a pcre pattern to let
libpcre segfault?

--

Cheers

tobi

Am 12.05.20 um 15:20 schrieb Wietse Venema:
> Tobi:
>> Hi list
>>
>> we have the very rare problem that cleanup "triggers" segfaults in
>> libpcre. We're currently running postfix 3.5.1 but had the issue before
>> updating (with postfix 3.4.4) as well. OS is a Centos7 with latest updates.
>>
>>> May 12 14:36:14 XXX kernel: cleanup[23927]: segfault at 7ffc5118ef78
>>> ip 7fd07913c98a sp 7ffc5118ef70 error 6 in
>>> libpcre.so.1.2.0[7fd079129000+6]
>>
>> libpcre is of version 8.32
>>
>> Unfortunately we do not have access to the message source. But it's
>> always the same message that triggers the segfault. Every re-send try
>> from sending server ends in that error.
>> We highly assume that one of our pcre rules could be the culprit, but
>> it's hard to find out which one it is. Postfix never complains about a
>> broken pattern or something like that in logs. Also testing with postmap
>> -vv -q works without any error/warning for all our pcre maps. There is
>> no suspicious logging prior to cleanup "crash".
>>
>> Is there a way to narrow down which pcre rule may is problematic, given
>> the fact that we do not have access to message source?
> 
> Use dummy rules:
> 
> /(.+)/info before foo: $1
> /foo/ action for foo
> /(.+)/info before bar
> /bar/ action for bar
> 
>   Wietse
> 



signature.asc
Description: OpenPGP digital signature


Re: Segfaults libpcre in cleanup

2020-05-12 Thread Wietse Venema
Tobi:
> Hi list
> 
> we have the very rare problem that cleanup "triggers" segfaults in
> libpcre. We're currently running postfix 3.5.1 but had the issue before
> updating (with postfix 3.4.4) as well. OS is a Centos7 with latest updates.
> 
> > May 12 14:36:14 XXX kernel: cleanup[23927]: segfault at 7ffc5118ef78
> > ip 7fd07913c98a sp 7ffc5118ef70 error 6 in
> > libpcre.so.1.2.0[7fd079129000+6]
> 
> libpcre is of version 8.32
> 
> Unfortunately we do not have access to the message source. But it's
> always the same message that triggers the segfault. Every re-send try
> from sending server ends in that error.
> We highly assume that one of our pcre rules could be the culprit, but
> it's hard to find out which one it is. Postfix never complains about a
> broken pattern or something like that in logs. Also testing with postmap
> -vv -q works without any error/warning for all our pcre maps. There is
> no suspicious logging prior to cleanup "crash".
> 
> Is there a way to narrow down which pcre rule may is problematic, given
> the fact that we do not have access to message source?

Use dummy rules:

/(.+)/  info before foo: $1
/foo/   action for foo
/(.+)/  info before bar
/bar/   action for bar

Wietse