Re: pg_parse_json() should not leak token copies on failure

2024-06-04 Thread Michael Paquier
On Tue, Jun 04, 2024 at 10:10:06AM -0700, Jacob Champion wrote:
> On Mon, Jun 3, 2024 at 9:32 PM Kyotaro Horiguchi
>  wrote:
>> I understand that the part enclosed in parentheses refers to the "if
>> (ptr) pfree(ptr)" structure. I believe we rely on the behavior of
>> free(NULL), which safely does nothing. (I couldn't find the related
>> discussion due to a timeout error on the ML search page.)
> 
> For the frontend, right. For the backend, pfree(NULL) causes a crash.
> I think [1] is a related discussion on the list, maybe the one you
> were looking for?

FYI, these choices relate also to 805a397db40b, e890ce7a4feb and
098c703d308f.
--
Michael


signature.asc
Description: PGP signature


Re: pg_parse_json() should not leak token copies on failure

2024-06-04 Thread Jacob Champion
On Mon, Jun 3, 2024 at 9:32 PM Kyotaro Horiguchi
 wrote:
>
> Hi,

Thanks for the review!

> I understand that the part enclosed in parentheses refers to the "if
> (ptr) pfree(ptr)" structure. I believe we rely on the behavior of
> free(NULL), which safely does nothing. (I couldn't find the related
> discussion due to a timeout error on the ML search page.)

For the frontend, right. For the backend, pfree(NULL) causes a crash.
I think [1] is a related discussion on the list, maybe the one you
were looking for?

> Although I don't fully understand the entire parser code, it seems
> that the owner transfer only occurs in JSON_TOKEN_STRING cases. That
> is, the memory pointed by scalar_val would become dangling in
> JSON_TOKEN_NUBMER cases.

It should happen for both cases, I think. I see that the
JSON_SEM_SCALAR_CALL case passes scalar_val into the callback
regardless of whether we're parsing a string or a number, and
parse_scalar() does the same thing over in the recursive
implementation. Have I missed a code path?

> Even if this is not the case, the ownership
> transition apperas quite callenging to follow.

I agree!

> It might be safer or clearer to pstrdup the token in jsonb_in_scalar()
> and avoid NULLifying scalar_val after calling callbacks, or to let
> jsonb_in_sclar() NULLify the pointer.

Maybe. jsonb_in_scalar isn't the only scalar callback implementation,
though. It might be a fairly cruel change for dependent extensions,
since there will be no compiler complaint.

If a compatibility break is acceptable, maybe we should make the API
zero-copy for the recursive case, and just pass the length of the
parsed token? Either way, we'd have to change the object field
callbacks, too, but maybe an API change makes even more sense there,
given how convoluted it is right now.

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/flat/cf26e970-8e92-59f1-247a-aa265235075b%40enterprisedb.com




Re: pg_parse_json() should not leak token copies on failure

2024-06-03 Thread Kyotaro Horiguchi
Hi,

At Fri, 24 May 2024 08:43:01 -0700, Jacob Champion 
 wrote in 
> On Tue, Apr 30, 2024 at 2:29 PM Jacob Champion
>  wrote:
> > Attached is a draft patch to illustrate what I mean, but it's
> > incomplete: it only solves the problem for scalar values.
> 
> (Attached is a v2 of that patch; in solving a frontend leak I should
> probably not introduce a backend segfault.)

I understand that the part enclosed in parentheses refers to the "if
(ptr) pfree(ptr)" structure. I believe we rely on the behavior of
free(NULL), which safely does nothing. (I couldn't find the related
discussion due to a timeout error on the ML search page.)

Although I don't fully understand the entire parser code, it seems
that the owner transfer only occurs in JSON_TOKEN_STRING cases. That
is, the memory pointed by scalar_val would become dangling in
JSON_TOKEN_NUBMER cases. Even if this is not the case, the ownership
transition apperas quite callenging to follow.

It might be safer or clearer to pstrdup the token in jsonb_in_scalar()
and avoid NULLifying scalar_val after calling callbacks, or to let
jsonb_in_sclar() NULLify the pointer.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: pg_parse_json() should not leak token copies on failure

2024-05-24 Thread Jacob Champion
On Tue, Apr 30, 2024 at 2:29 PM Jacob Champion
 wrote:
> Attached is a draft patch to illustrate what I mean, but it's
> incomplete: it only solves the problem for scalar values.

(Attached is a v2 of that patch; in solving a frontend leak I should
probably not introduce a backend segfault.)

--Jacob


v2-0001-WIP-fix-leak-of-scalar-value-on-lex-failure.patch
Description: Binary data


pg_parse_json() should not leak token copies on failure

2024-04-30 Thread Jacob Champion
Hi,

When a client of our JSON parser registers semantic action callbacks,
the parser will allocate copies of the lexed tokens to pass into those
callbacks. The intent is for the callbacks to always take ownership of
the copied memory, and if they don't want it then they can pfree() it.

However, if parsing fails on the token before the callback is invoked,
that allocation is leaked. That's not a problem for the server side,
or for clients that immediately exit on parse failure, but if the JSON
parser gets added to libpq for OAuth, it will become more of a
problem.

(I'm building a fuzzer to flush out these sorts of issues; not only
does it complain loudly about the leaks, but the accumulation of
leaked memory puts a hard limit on how long a fuzzer run can last. :D)

Attached is a draft patch to illustrate what I mean, but it's
incomplete: it only solves the problem for scalar values. We also make
a copy of object field names, which is much harder to fix, because we
make only a single copy and pass that to both the object_field_start
and object_field_end callbacks. Consider:

- If a client only implements object_field_start, it takes ownership
of the field name when we call it. It can free the allocation if it
decides that the field is irrelevant.
- The same is true for clients that only implement object_field_end.
- If a client implements both callbacks, it takes ownership of the
field name when we call object_field_start. But irrelevant field names
can't be freed during that callback, because the same copy is going to
be passed to object_field_end. And object_field_end isn't guaranteed
to be called, because parsing could fail for any number of reasons
between now and then. So what code should be responsible for cleanup?
The parser doesn't know whether the callback already freed the pointer
or kept a reference to it in semstate.

Any thoughts on how we can improve this? I was thinking we could maybe
make two copies of the field name and track ownership individually?

Thanks,
--Jacob


0001-WIP-fix-leak-of-scalar-value-on-lex-failure.patch
Description: Binary data