Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c

2022-08-01 Thread Ruediger Pluem



On 8/1/22 5:05 PM, Ivan Zhakov wrote:
> On 2022/07/29 06:19:03 Ruediger Pluem wrote:
>>
>>
>> On 7/16/22 10:28 AM, Ivan Zhakov wrote:
>>> On Wed, 13 Jul 2022 at 18:06, Yann Ylavic >> > wrote:
>>>
>>> On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov >> > wrote:
>>> >
>>> > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic >> > wrote:
>>> > >
>>> > > Let me explain (maybe), I thought that PCRE2's match_data were not
>>> > > only containing the captures vector but also all the frames needed 
>>> for
>>> > > matching not-too-complex regexes (at least the initial ones since 
>>> they
>>> > > will be reallocated on need). So the (supposed) 2KB looked necessary
>>> > > to me for a few stack frames.
>>> > > But the actual 256 bytes only made me look at the PCRE2 code again
>>> > > (better this time) for how that can possibly work, and guess what: 
>>> the
>>> > > initial frames are allocated on the stack ([1]) still in PCRE2, and
>>> > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
>>> > >
>>> > > On Alpine Linux for instance (quite used in docker envs) there is 
>>> only
>>> > > 128KiB stack per thread, so I've always configured
>>> > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
>>> > > vector cache, with custom pcre_malloc/free() to avoid performances
>>> > > overhead and be able to run more complex regexes without exploding,
>>> > > and I can't do that now with PCRE2 and without requiring 20KB of 
>>> stack
>>> > > :/
>>> > > So I created a pull request ([3]).
>>> > >
>>> > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. 
>>> But
>>> > I would suggest a different solution: it would be nice if PCRE 
>>> provided
>>> > a compile time directive to specify the maximum stack allocated 
>>> buffer.
>>> > So then distributions like Alpine Linux would specify more appropriate
>>> > defaults.
>>>
>>> This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
>>> solution indeed.
>>> Say START_FRAMES_SIZE is set to some minimum, a single struct
>>> heapframe with 24 captures, that's (on a 64bit system):
>>>   offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
>>> ~= 512 bytes
>>> which would be much appreciated, but would also put much more pressure
>>> on util_regex's private_malloc().
>>> Will ap_regexec() start to use malloc() quite often with the current
>>> implementation then?
>>> Up to the old/default START_FRAMES_SIZE if it was a good default 
>>> afterall?
>>>
>>> But we are not here, because I don't think distros will change the
>>> default pcre START_FRAMES_SIZE for all possible code they run, so most
>>> httpd users (which run on distros I suppose) will continue to consume
>>> 20K of stack memory for every ap_regexec() call.
>>>
>>> The flexible way for pcre2 users (and httpd maintainers) would be to
>>> let them opt-out (at runtime) of stack allocation from pcre2_match(),
>>> which is the patch I proposed.
>>>
>>> I see several problems in this approach:
>>> 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending 
>>> of the platform.
>>> 2. The implementation may introduce unbounded memory usage, if PCRE2 
>>> performs multiple malloc/free during pattern matching. It
>>> seems that currently it doesn't, but we cannot rely on the implementation 
>>> details. Stackbuf code doesn't have this problem,
>>> because it's limited to a fixed-size buffer.
>>> 3. As I said before, the proposed approach effectively increases per thread 
>>> memory usage: 4 or 8 kb for match data stored in
>>> thread state + stack size. Which is something I don't really understand, 
>>> because if we are fine with that, then we can just
>>> increase the thread stack size.
>>>
>>> With the above in mind, I'm -1 for this approach and r1902731.
>>>
>>
>> The latest implementation gives you both, a small allocation on the stack 
>> that does not drive stack usage
>> up too much and that should be sufficient for many regex cases. If this does 
>> not work we use tls to tuck away a pointer to a per
>> thread pool and we only create the pool in case the stack memory isn't 
>> sufficient.
>> If we really increase the memory usage if need to use the pool isn't clear 
>> to me as we use a subpool of the thread pool and thus
>> its allocator. Maybe that allocator has free nodes still around and we reuse 
>> these.
>> With regards to unbound memory usage I don't think that this is an actual 
>> problem, but if PCRE would really massively malloc/free
>> beyond the space we use on the stack just using malloc/free to satisfy these 
>> needs would be problematic as well and I guess this
>> would probably need fixing in PCRE then.
>> Furthermore if this happens we could have a look at using 

Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c

2022-08-01 Thread Ivan Zhakov
On 2022/07/29 06:19:03 Ruediger Pluem wrote:
> 
> 
> On 7/16/22 10:28 AM, Ivan Zhakov wrote:
> > On Wed, 13 Jul 2022 at 18:06, Yann Ylavic  > > wrote:
> > 
> > On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov  > > wrote:
> > >
> > > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic  > > wrote:
> > > >
> > > > Let me explain (maybe), I thought that PCRE2's match_data were not
> > > > only containing the captures vector but also all the frames needed 
> > for
> > > > matching not-too-complex regexes (at least the initial ones since 
> > they
> > > > will be reallocated on need). So the (supposed) 2KB looked necessary
> > > > to me for a few stack frames.
> > > > But the actual 256 bytes only made me look at the PCRE2 code again
> > > > (better this time) for how that can possibly work, and guess what: 
> > the
> > > > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> > > >
> > > > On Alpine Linux for instance (quite used in docker envs) there is 
> > only
> > > > 128KiB stack per thread, so I've always configured
> > > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > > > vector cache, with custom pcre_malloc/free() to avoid performances
> > > > overhead and be able to run more complex regexes without exploding,
> > > > and I can't do that now with PCRE2 and without requiring 20KB of 
> > stack
> > > > :/
> > > > So I created a pull request ([3]).
> > > >
> > > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. 
> > But
> > > I would suggest a different solution: it would be nice if PCRE 
> > provided
> > > a compile time directive to specify the maximum stack allocated 
> > buffer.
> > > So then distributions like Alpine Linux would specify more appropriate
> > > defaults.
> > 
> > This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
> > solution indeed.
> > Say START_FRAMES_SIZE is set to some minimum, a single struct
> > heapframe with 24 captures, that's (on a 64bit system):
> >   offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
> > ~= 512 bytes
> > which would be much appreciated, but would also put much more pressure
> > on util_regex's private_malloc().
> > Will ap_regexec() start to use malloc() quite often with the current
> > implementation then?
> > Up to the old/default START_FRAMES_SIZE if it was a good default 
> > afterall?
> > 
> > But we are not here, because I don't think distros will change the
> > default pcre START_FRAMES_SIZE for all possible code they run, so most
> > httpd users (which run on distros I suppose) will continue to consume
> > 20K of stack memory for every ap_regexec() call.
> > 
> > The flexible way for pcre2 users (and httpd maintainers) would be to
> > let them opt-out (at runtime) of stack allocation from pcre2_match(),
> > which is the patch I proposed.
> > 
> > I see several problems in this approach:
> > 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending 
> > of the platform.
> > 2. The implementation may introduce unbounded memory usage, if PCRE2 
> > performs multiple malloc/free during pattern matching. It
> > seems that currently it doesn't, but we cannot rely on the implementation 
> > details. Stackbuf code doesn't have this problem,
> > because it's limited to a fixed-size buffer.
> > 3. As I said before, the proposed approach effectively increases per thread 
> > memory usage: 4 or 8 kb for match data stored in
> > thread state + stack size. Which is something I don't really understand, 
> > because if we are fine with that, then we can just
> > increase the thread stack size.
> > 
> > With the above in mind, I'm -1 for this approach and r1902731.
> > 
> 
> The latest implementation gives you both, a small allocation on the stack 
> that does not drive stack usage
> up too much and that should be sufficient for many regex cases. If this does 
> not work we use tls to tuck away a pointer to a per
> thread pool and we only create the pool in case the stack memory isn't 
> sufficient.
> If we really increase the memory usage if need to use the pool isn't clear to 
> me as we use a subpool of the thread pool and thus
> its allocator. Maybe that allocator has free nodes still around and we reuse 
> these.
> With regards to unbound memory usage I don't think that this is an actual 
> problem, but if PCRE would really massively malloc/free
> beyond the space we use on the stack just using malloc/free to satisfy these 
> needs would be problematic as well and I guess this
> would probably need fixing in PCRE then.
> Furthermore if this happens we could have a look at using apr_bucket_alloc / 
> apr_bucket_free or as this might