Re: WIP Incremental JSON Parser

2024-04-24 Thread Andrew Dunstan



On 2024-04-24 We 04:56, Michael Paquier wrote:

On Fri, Apr 19, 2024 at 02:04:40PM -0400, Robert Haas wrote:

Yeah, I think this patch invented a new solution to a problem that
we've solved in a different way everywhere else. I think we should
change it to match what we do in general.

As of ba3e6e2bca97, did you notice that test_json_parser_perf
generates two core files because progname is not set, failing an
assertion when using the frontend logging?



No, it didn't for me. Thanks for noticing, I've pushed a fix.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-24 Thread Michael Paquier
On Fri, Apr 19, 2024 at 02:04:40PM -0400, Robert Haas wrote:
> Yeah, I think this patch invented a new solution to a problem that
> we've solved in a different way everywhere else. I think we should
> change it to match what we do in general.

As of ba3e6e2bca97, did you notice that test_json_parser_perf
generates two core files because progname is not set, failing an
assertion when using the frontend logging?
--
Michael


signature.asc
Description: PGP signature


Re: WIP Incremental JSON Parser

2024-04-19 Thread Robert Haas
On Fri, Apr 19, 2024 at 1:35 AM Michael Paquier  wrote:
> Are you sure that relying on Temp::File is a good thing overall?  The
> current temporary file knowledge is encapsulated within Utils.pm, with
> files removed or kept depending on PG_TEST_NOCLEAN.  So it would be
> just more consistent to rely on the existing facilities instead?
> test_json_parser is the only code path in the whole tree that directly
> uses File::Temp.  The rest of the TAP tests relies on Utils.pm for
> temp file paths.

Yeah, I think this patch invented a new solution to a problem that
we've solved in a different way everywhere else. I think we should
change it to match what we do in general.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-04-18 Thread Michael Paquier
On Thu, Apr 18, 2024 at 06:03:45AM -0400, Andrew Dunstan wrote:
> On 2024-04-18 Th 02:04, Michael Paquier wrote:
>>   Why usingFile::Temp::tempfile  here?  Couldn't you just use a
>> file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up
>> once the test finishes?
> 
> That's another possibility, but I think the above is the simplest.

Are you sure that relying on Temp::File is a good thing overall?  The
current temporary file knowledge is encapsulated within Utils.pm, with
files removed or kept depending on PG_TEST_NOCLEAN.  So it would be
just more consistent to rely on the existing facilities instead?
test_json_parser is the only code path in the whole tree that directly
uses File::Temp.  The rest of the TAP tests relies on Utils.pm for
temp file paths.
--
Michael


signature.asc
Description: PGP signature


Re: WIP Incremental JSON Parser

2024-04-18 Thread Andrew Dunstan


On 2024-04-18 Th 02:04, Michael Paquier wrote:

On Wed, Apr 10, 2024 at 07:47:38AM -0400, Andrew Dunstan wrote:

Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed
out to me by Alexander Lakhin.

I can see that this has been applied as of 42fa4b660143 with some
extra commits.

Anyway, I have noticed another thing in the surroundings that's
annoying.  003 has this logic:
useFile::Temp  qw(tempfile);
[...]
my ($fh, $fname) = tempfile();

print $fh $stdout,"\n";

close($fh);

This creates a temporary file in /tmp/ that remains around, slowing
bloating the temporary file space on a node while leaving around some
data.



My bad, I should have used the UNLINK option like in the other tests.




  Why usingFile::Temp::tempfile  here?  Couldn't you just use a
file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up
once the test finishes?



That's another possibility, but I think the above is the simplest.




Per [1], escape_json() has no coverage outside its default path.  Is
that intended?



Not particularly. I'll add some stuff to get complete coverage.




Documenting all these test files with a few comments would be welcome,
as well, with some copyright notices...



ok




 json_file = fopen(testfile, "r");
 fstat(fileno(json_file), );
 bytes_left = statbuf.st_size;

No checks on failure of fstat() here?



ok will fix




 json_file = fopen(argv[2], "r");

Second one in test_json_parser_perf.c, with more stuff for fread().



ok will fix

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: WIP Incremental JSON Parser

2024-04-18 Thread Michael Paquier
On Wed, Apr 10, 2024 at 07:47:38AM -0400, Andrew Dunstan wrote:
> Here's v2 of the cleanup patch 4, that fixes some more typos kindly pointed
> out to me by Alexander Lakhin.

I can see that this has been applied as of 42fa4b660143 with some
extra commits.

Anyway, I have noticed another thing in the surroundings that's
annoying.  003 has this logic:
use File::Temp qw(tempfile);
[...]
my ($fh, $fname) = tempfile();

print $fh $stdout,"\n";

close($fh);

This creates a temporary file in /tmp/ that remains around, slowing
bloating the temporary file space on a node while leaving around some
data.  Why using File::Temp::tempfile here?  Couldn't you just use a
file in a PostgreSQL::Test::Utils::tempdir() that would be cleaned up
once the test finishes?

Per [1], escape_json() has no coverage outside its default path.  Is
that intended?

Documenting all these test files with a few comments would be welcome,
as well, with some copyright notices...

json_file = fopen(testfile, "r");
fstat(fileno(json_file), );
bytes_left = statbuf.st_size;

No checks on failure of fstat() here?

json_file = fopen(argv[2], "r");

Second one in test_json_parser_perf.c, with more stuff for fread().

[1]: 
https://coverage.postgresql.org/src/test/modules/test_json_parser/test_json_parser_incremental.c.gcov.html
--
Michael


signature.asc
Description: PGP signature


Re: WIP Incremental JSON Parser

2024-04-17 Thread Michael Paquier
On Tue, Apr 09, 2024 at 09:26:49AM -0700, Jacob Champion wrote:
> On Tue, Apr 9, 2024 at 7:30 AM Andrew Dunstan  wrote:
>> I think Michael's point was that if we carry the code we should test we
>> can run it. The other possibility would be just to remove it. I can see
>> arguments for both.
> 
> Hm. If it's not acceptable to carry this (as a worse-is-better smoke
> test) without also running it during tests, then my personal vote
> would be to tear it out and just have people write/contribute targeted
> benchmarks when they end up working on performance. I don't think the
> cost/benefit makes sense at that point.

And you may catch up a couple of bugs while on it.  In my experience,
things with custom makefile and/or meson rules tend to rot easily
because everybody forgets about them.  There are a few of them in the
tree that could be ripped off, as well..
--
Michael


signature.asc
Description: PGP signature


Re: WIP Incremental JSON Parser

2024-04-10 Thread Andrew Dunstan


On 2024-04-09 Tu 15:42, Andrew Dunstan wrote:


On 2024-04-09 Tu 07:54, Andrew Dunstan wrote:



On 2024-04-09 Tu 01:23, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:

There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.



Not adding a test for that was deliberate - any sane test takes a 
while, and I didn't want to spend that much time on it every time 
someone runs "make check-world" or equivalent. However, adding a test 
to run it with a trivial number of iterations seems reasonable, so 
I'll add that. I'll also add a meson target for the binary.




While reading the code, I've noticed a few things, as well:

+   /* max delicious line length is less than this */
+   char    buff[6001];

Delicious applies to the contents, nothing to do with this code even
if I'd like to think that these lines of code are edible and good.
Using a hardcoded limit for some JSON input sounds like a bad idea to
me even if this is a test module.  Comment that applies to both the
perf and the incremental tools.  You could use a #define'd buffer size
for readability rather than assuming this value in many places.



The comment is a remnant of when I hadn't yet added support for 
incomplete tokens, and so I had to parse the input line by line. I 
agree it can go, and we can use a manifest constant for the buffer size.




+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
+ * This progam tests incremental parsing of json. The input is fed 
into
+ * full range of incement handling, especially in the lexer, is 
exercised.

+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
+ *    Performancet est program for both flavors of the JSON parser
+ * This progam tests either the standard (recursive descent) JSON 
parser

+++ b/src/test/modules/test_json_parser/README
+  reads in a file and pases it in very small chunks (60 bytes at a 
time) to


Collection of typos across various files.



Will fix. (The older I get the more typos I seem to make and the 
harder it is to notice them. It's very annoying.)




+ appendStringInfoString(, "1+23 trailing junk");
What's the purpose here?  Perhaps the intention should be documented
in a comment?



The purpose is to ensure that if there is not a trailing '\0' on the 
json chunk the parser will still do the right thing. I'll add a 
comment to that effect.




At the end, having a way to generate JSON blobs randomly to test this
stuff would be more appealing than what you have currently, with
perhaps a few factors like:
- Array and simple object density.
- Max Level of nesting.
- Limitation to ASCII characters that can be escaped.
- Perhaps more things I cannot think about?



No, I disagree. Maybe we need those things as well, but we do need a 
static test where we can test the output against known results. I 
have no objection to changing the input and output files.


It's worth noting that t/002_inline.pl does generate some input and 
test e.g., the maximum nesting levels among other errors. Perhaps you 
missed that. If you think we need more tests there adding them would 
be extremely simple.




So the current state of things is kind of disappointing, and the size
of the data set added to the tree is not that portable either if you
want to test various scenarios depending on the data set.  It seems to
me that this has been committed too hastily and that this is not ready
for integration, even if that's just a test module.  Tom also has
shared some concerns upthread, as far as I can see.



I have posted a patch already that addresses the issue Tom raised.





Here's a consolidated set of cleanup patches, including the memory 
leak patch and Jacob's shrink-tiny patch.



Here's v2 of the cleanup patch 4, that fixes some more typos kindly 
pointed out to me by Alexander Lakhin.



cheers


andrew




--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From 6b12a71b6b43354c0f897ac5feb1e53419a2c15f Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Tue, 9 Apr 2024 15:30:48 -0400
Subject: [PATCH v2] Assorted minor cleanups in the test_json_parser module

Per gripes from Michael Paquier

Discussion: https://postgr.es/m/zhtq6_w1vwohq...@paquier.xyz

Along the way, also clean up a handful of typos in 3311ea86ed and
ea7b4e9a2a, found by Alexander Lakhin.
---
 src/common/jsonapi.c  |  6 ++---
 src/common/parse_manifest.c   |  2 +-
 src/test/modules/test_json_parser/README  | 19 +++--
 .../test_json_parser_incremental.c| 27 ++-
 .../test_json_parser/test_json_parser_perf.c  | 11 
 5 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 9dfbc397c0..12fabcaccf 100644
--- a/src/common/jsonapi.c
+++ 

Re: WIP Incremental JSON Parser

2024-04-09 Thread Jacob Champion
On Mon, Apr 8, 2024 at 10:24 PM Michael Paquier  wrote:
> At the end, having a way to generate JSON blobs randomly to test this
> stuff would be more appealing

For the record, I'm working on an LLVM fuzzer target for the JSON
parser. I think that would be a lot more useful than anything we can
hand-code.

But I want it to cover both the recursive and incremental code paths,
and we'd need to talk about where it would live. libfuzzer is seeded
with a bunch of huge incomprehensible blobs, which is something we're
now trying to avoid checking in. There's also the security aspect of
"what do we do when it finds something", and at that point maybe we
need to look into a service like oss-fuzz.

Thanks,
--Jacob




Re: WIP Incremental JSON Parser

2024-04-09 Thread Jacob Champion
On Tue, Apr 9, 2024 at 7:30 AM Andrew Dunstan  wrote:
> I think Michael's point was that if we carry the code we should test we
> can run it. The other possibility would be just to remove it. I can see
> arguments for both.

Hm. If it's not acceptable to carry this (as a worse-is-better smoke
test) without also running it during tests, then my personal vote
would be to tear it out and just have people write/contribute targeted
benchmarks when they end up working on performance. I don't think the
cost/benefit makes sense at that point.

--Jacob




Re: WIP Incremental JSON Parser

2024-04-09 Thread Andrew Dunstan



On 2024-04-09 Tu 09:45, Jacob Champion wrote:

On Tue, Apr 9, 2024 at 4:54 AM Andrew Dunstan  wrote:

On 2024-04-09 Tu 01:23, Michael Paquier wrote:
There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.

Not adding a test for that was deliberate - any sane test takes a while, and I didn't 
want to spend that much time on it every time someone runs "make check-world" 
or equivalent. However, adding a test to run it with a trivial number of iterations seems 
reasonable, so I'll add that. I'll also add a meson target for the binary.

Okay, but for what purpose? My understanding during review was that
this was a convenience utility for people who were actively hacking on
the code (and I used it for exactly that purpose a few months back, so
I didn't question that any further). Why does the farm need to spend
any time running it at all?



I think Michael's point was that if we carry the code we should test we 
can run it. The other possibility would be just to remove it. I can see 
arguments for both.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-09 Thread Jacob Champion
On Tue, Apr 9, 2024 at 4:54 AM Andrew Dunstan  wrote:
> On 2024-04-09 Tu 01:23, Michael Paquier wrote:
> There is no direct check on test_json_parser_perf.c, either, only a
> custom rule in the Makefile without specifying something for meson.
> So it looks like you could do short execution check in a TAP test, at
> least.
>
> Not adding a test for that was deliberate - any sane test takes a while, and 
> I didn't want to spend that much time on it every time someone runs "make 
> check-world" or equivalent. However, adding a test to run it with a trivial 
> number of iterations seems reasonable, so I'll add that. I'll also add a 
> meson target for the binary.

Okay, but for what purpose? My understanding during review was that
this was a convenience utility for people who were actively hacking on
the code (and I used it for exactly that purpose a few months back, so
I didn't question that any further). Why does the farm need to spend
any time running it at all?

--Jacob




Re: WIP Incremental JSON Parser

2024-04-09 Thread Andrew Dunstan


On 2024-04-09 Tu 01:23, Michael Paquier wrote:

On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:

There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.



Not adding a test for that was deliberate - any sane test takes a while, 
and I didn't want to spend that much time on it every time someone runs 
"make check-world" or equivalent. However, adding a test to run it with 
a trivial number of iterations seems reasonable, so I'll add that. I'll 
also add a meson target for the binary.




While reading the code, I've noticed a few things, as well:

+   /* max delicious line length is less than this */
+   charbuff[6001];

Delicious applies to the contents, nothing to do with this code even
if I'd like to think that these lines of code are edible and good.
Using a hardcoded limit for some JSON input sounds like a bad idea to
me even if this is a test module.  Comment that applies to both the
perf and the incremental tools.  You could use a #define'd buffer size
for readability rather than assuming this value in many places.



The comment is a remnant of when I hadn't yet added support for 
incomplete tokens, and so I had to parse the input line by line. I agree 
it can go, and we can use a manifest constant for the buffer size.





+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
+ * This progam tests incremental parsing of json. The input is fed into
+ * full range of incement handling, especially in the lexer, is exercised.
+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
+ *Performancet est program for both flavors of the JSON parser
+ * This progam tests either the standard (recursive descent) JSON parser
+++ b/src/test/modules/test_json_parser/README
+  reads in a file and pases it in very small chunks (60 bytes at a time) to

Collection of typos across various files.



Will fix. (The older I get the more typos I seem to make and the harder 
it is to notice them. It's very annoying.)





+   appendStringInfoString(, "1+23 trailing junk");
What's the purpose here?  Perhaps the intention should be documented
in a comment?



The purpose is to ensure that if there is not a trailing '\0' on the 
json chunk the parser will still do the right thing. I'll add a comment 
to that effect.





At the end, having a way to generate JSON blobs randomly to test this
stuff would be more appealing than what you have currently, with
perhaps a few factors like:
- Array and simple object density.
- Max Level of nesting.
- Limitation to ASCII characters that can be escaped.
- Perhaps more things I cannot think about?



No, I disagree. Maybe we need those things as well, but we do need a 
static test where we can test the output against known results. I have 
no objection to changing the input and output files.


It's worth noting that t/002_inline.pl does generate some input and test 
e.g., the maximum nesting levels among other errors. Perhaps you missed 
that. If you think we need more tests there adding them would be 
extremely simple.





So the current state of things is kind of disappointing, and the size
of the data set added to the tree is not that portable either if you
want to test various scenarios depending on the data set.  It seems to
me that this has been committed too hastily and that this is not ready
for integration, even if that's just a test module.  Tom also has
shared some concerns upthread, as far as I can see.



I have posted a patch already that addresses the issue Tom raised.


cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: WIP Incremental JSON Parser

2024-04-08 Thread Michael Paquier
On Tue, Apr 09, 2024 at 09:48:18AM +0900, Michael Paquier wrote:
> There is no direct check on test_json_parser_perf.c, either, only a
> custom rule in the Makefile without specifying something for meson.
> So it looks like you could do short execution check in a TAP test, at
> least.

While reading the code, I've noticed a few things, as well:

+   /* max delicious line length is less than this */
+   charbuff[6001];

Delicious applies to the contents, nothing to do with this code even
if I'd like to think that these lines of code are edible and good.
Using a hardcoded limit for some JSON input sounds like a bad idea to
me even if this is a test module.  Comment that applies to both the
perf and the incremental tools.  You could use a #define'd buffer size
for readability rather than assuming this value in many places.

+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c 
+ * This progam tests incremental parsing of json. The input is fed into 
+ * full range of incement handling, especially in the lexer, is exercised. 
+++ b/src/test/modules/test_json_parser/test_json_parser_perf.c
+ *Performancet est program for both flavors of the JSON parser 
+ * This progam tests either the standard (recursive descent) JSON parser 
+++ b/src/test/modules/test_json_parser/README 
+  reads in a file and pases it in very small chunks (60 bytes at a time) to 

Collection of typos across various files.

+   appendStringInfoString(, "1+23 trailing junk"); 
What's the purpose here?  Perhaps the intention should be documented
in a comment?

At the end, having a way to generate JSON blobs randomly to test this
stuff would be more appealing than what you have currently, with
perhaps a few factors like:
- Array and simple object density.
- Max Level of nesting.
- Limitation to ASCII characters that can be escaped.
- Perhaps more things I cannot think about?

So the current state of things is kind of disappointing, and the size
of the data set added to the tree is not that portable either if you
want to test various scenarios depending on the data set.  It seems to
me that this has been committed too hastily and that this is not ready
for integration, even if that's just a test module.  Tom also has
shared some concerns upthread, as far as I can see.
--
Michael


signature.asc
Description: PGP signature


Re: WIP Incremental JSON Parser

2024-04-08 Thread Michael Paquier
On Mon, Apr 08, 2024 at 05:42:35PM -0400, Andrew Dunstan wrote:
> Arguably the fact that it points nowhere is a good thing. But feel free to
> replace it with something else. It doesn't have to be URLs at all. That
> happened simply because it was easy to extract from a very large piece of
> JSON I had lying around, probably from the last time I wrote a JSON parser
> :-)

: "http://www.theatermania.com/broadway/;,

As as matter of fact, this one points to something that seems to be
real.  Most of the blobs had better be trimmed down.

Looking at the code paths in this module, it looks like this could be
even simpler than what Jacob is shaving, checking for:
- Objects start and end, including booleans, ints and strings for the
quotations.
- Nested objects, including arrays made of simple JSON objects.

Some escape patterns are already checked, but most of them are
missing:
https://coverage.postgresql.org/src/test/modules/test_json_parser/test_json_parser_incremental.c.gcov.html

There is no direct check on test_json_parser_perf.c, either, only a
custom rule in the Makefile without specifying something for meson.
So it looks like you could do short execution check in a TAP test, at
least.
--
Michael


signature.asc
Description: PGP signature


Re: WIP Incremental JSON Parser

2024-04-08 Thread Andrew Dunstan


On 2024-04-08 Mo 09:29, Andrew Dunstan wrote:


On 2024-04-07 Su 20:58, Tom Lane wrote:

Coverity complained that this patch leaks memory:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c: 
212 in load_backup_manifest()

206 bytes_left -= rc;
207 json_parse_manifest_incremental_chunk(
208 inc_state, buffer, rc, bytes_left == 0);
209 }
210
211 close(fd);

 CID 1596259:    (RESOURCE_LEAK)
 Variable "inc_state" going out of scope leaks the storage it 
points to.

212 }
213
214 /* All done. */
215 pfree(buffer);
216 return result;
217 }

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 
488 in parse_manifest_file()

482 bytes_left -= rc;
483 json_parse_manifest_incremental_chunk(
484 inc_state, buffer, rc, bytes_left == 0);
485 }
486
487 close(fd);

 CID 1596257:    (RESOURCE_LEAK)
 Variable "inc_state" going out of scope leaks the storage it 
points to.

488 }
489
490 /* Done with the buffer. */
491 pfree(buffer);
492
493 return result;

It's right about that AFAICS, and not only is the "inc_state" itself
leaked but so is its assorted infrastructure.  Perhaps we don't care
too much about that in the existing applications, but ISTM that
isn't going to be a tenable assumption across the board. Shouldn't
there be a "json_parse_manifest_incremental_shutdown()" or the like
to deallocate all the storage allocated by the parser?




yeah, probably. Will work on it.





Here's a patch. In addition to the leaks Coverity found, there was 
another site in the backend code that should call the shutdown function, 
and a probably memory leak from a logic bug in the incremental json 
parser code. All these are fixed.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/backend/backup/basebackup_incremental.c b/src/backend/backup/basebackup_incremental.c
index 131598bade..8ae4a62ccc 100644
--- a/src/backend/backup/basebackup_incremental.c
+++ b/src/backend/backup/basebackup_incremental.c
@@ -241,6 +241,9 @@ FinalizeIncrementalManifest(IncrementalBackupInfo *ib)
 	pfree(ib->buf.data);
 	ib->buf.data = NULL;
 
+	/* Done with inc_state, so release that memory too */
+	json_parse_manifest_incremental_shutdown(ib->inc_state);
+
 	/* Switch back to previous memory context. */
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/bin/pg_combinebackup/load_manifest.c b/src/bin/pg_combinebackup/load_manifest.c
index 9c9332cdd5..d857ea0006 100644
--- a/src/bin/pg_combinebackup/load_manifest.c
+++ b/src/bin/pg_combinebackup/load_manifest.c
@@ -208,6 +208,9 @@ load_backup_manifest(char *backup_directory)
   inc_state, buffer, rc, bytes_left == 0);
 		}
 
+		/* Release the incremental state memory */
+		json_parse_manifest_incremental_shutdown(inc_state);
+
 		close(fd);
 	}
 
diff --git a/src/bin/pg_verifybackup/pg_verifybackup.c b/src/bin/pg_verifybackup/pg_verifybackup.c
index 90ef4b2037..9594c615c7 100644
--- a/src/bin/pg_verifybackup/pg_verifybackup.c
+++ b/src/bin/pg_verifybackup/pg_verifybackup.c
@@ -484,6 +484,9 @@ parse_manifest_file(char *manifest_path)
   inc_state, buffer, rc, bytes_left == 0);
 		}
 
+		/* Release the incremental state memory */
+		json_parse_manifest_incremental_shutdown(inc_state);
+
 		close(fd);
 	}
 
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 44dbb7f7f9..9dfbc397c0 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -488,19 +488,18 @@ freeJsonLexContext(JsonLexContext *lex)
 	if (lex->errormsg)
 		destroyStringInfo(lex->errormsg);
 
+	if (lex->incremental)
+	{
+		pfree(lex->inc_state->partial_token.data);
+		pfree(lex->inc_state);
+		pfree(lex->pstack->prediction);
+		pfree(lex->pstack->fnames);
+		pfree(lex->pstack->fnull);
+		pfree(lex->pstack);
+	}
+
 	if (lex->flags & JSONLEX_FREE_STRUCT)
-	{
-		if (lex->incremental)
-		{
-			pfree(lex->inc_state->partial_token.data);
-			pfree(lex->inc_state);
-			pfree(lex->pstack->prediction);
-			pfree(lex->pstack->fnames);
-			pfree(lex->pstack->fnull);
-			pfree(lex->pstack);
-		}
 		pfree(lex);
-	}
 }
 
 /*
diff --git a/src/common/parse_manifest.c b/src/common/parse_manifest.c
index 970a756ce8..a94e3d6b15 100644
--- a/src/common/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -123,7 +123,6 @@ static bool parse_xlogrecptr(XLogRecPtr *result, char *input);
 
 /*
  * Set up for incremental parsing of the manifest.
- *
  */
 
 JsonManifestParseIncrementalState *
@@ -163,6 +162,18 @@ json_parse_manifest_incremental_init(JsonManifestParseContext *context)
 	return incstate;
 }
 
+/*
+ * Free an incremental state object and its contents.
+ */
+void
+json_parse_manifest_incremental_shutdown(JsonManifestParseIncrementalState *incstate)
+{
+	pfree(incstate->sem.semstate);
+	

Re: WIP Incremental JSON Parser

2024-04-08 Thread Andrew Dunstan



On 2024-04-08 Mo 14:24, Jacob Champion wrote:

Michael pointed out over at [1] that the new tiny.json is pretty
inscrutable given its size, and I have to agree. Attached is a patch
to pare it down 98% or so. I think people wanting to run the
performance comparisons will need to come up with their own gigantic
files.



Let's see if we can do a bit better than that. Maybe a script to 
construct a larger input for the speed test from the smaller file. 
Should be pretty simple.





Michael, with your "Jacob might be a nefarious cabal of
state-sponsored hackers" hat on, is this observable enough, or do we
need to get it smaller? I was thinking we may want to replace the URLs
with stuff that doesn't link randomly around the Internet. Delicious
in its original form is long gone.



Arguably the fact that it points nowhere is a good thing. But feel free 
to replace it with something else. It doesn't have to be URLs at all. 
That happened simply because it was easy to extract from a very large 
piece of JSON I had lying around, probably from the last time I wrote a 
JSON parser :-)



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-08 Thread Andrew Dunstan



On 2024-04-07 Su 20:58, Tom Lane wrote:

Coverity complained that this patch leaks memory:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c:
 212 in load_backup_manifest()
206 bytes_left -= rc;
207 json_parse_manifest_incremental_chunk(
208 
  inc_state, buffer, rc, bytes_left == 0);
209 }
210
211 close(fd);

 CID 1596259:(RESOURCE_LEAK)
 Variable "inc_state" going out of scope leaks the storage it points to.

212 }
213
214 /* All done. */
215 pfree(buffer);
216 return result;
217 }

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:
 488 in parse_manifest_file()
482 bytes_left -= rc;
483 json_parse_manifest_incremental_chunk(
484 
  inc_state, buffer, rc, bytes_left == 0);
485 }
486
487 close(fd);

 CID 1596257:(RESOURCE_LEAK)
 Variable "inc_state" going out of scope leaks the storage it points to.

488 }
489
490 /* Done with the buffer. */
491 pfree(buffer);
492
493 return result;

It's right about that AFAICS, and not only is the "inc_state" itself
leaked but so is its assorted infrastructure.  Perhaps we don't care
too much about that in the existing applications, but ISTM that
isn't going to be a tenable assumption across the board.  Shouldn't
there be a "json_parse_manifest_incremental_shutdown()" or the like
to deallocate all the storage allocated by the parser?




yeah, probably. Will work on it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-07 Thread Tom Lane
Coverity complained that this patch leaks memory:

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_combinebackup/load_manifest.c:
 212 in load_backup_manifest()
206 bytes_left -= rc;
207 json_parse_manifest_incremental_chunk(
208 
  inc_state, buffer, rc, bytes_left == 0);
209 }
210 
211 close(fd);
>>> CID 1596259:(RESOURCE_LEAK)
>>> Variable "inc_state" going out of scope leaks the storage it points to.
212 }
213 
214 /* All done. */
215 pfree(buffer);
216 return result;
217 }

/srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c:
 488 in parse_manifest_file()
482 bytes_left -= rc;
483 json_parse_manifest_incremental_chunk(
484 
  inc_state, buffer, rc, bytes_left == 0);
485 }
486 
487 close(fd);
>>> CID 1596257:(RESOURCE_LEAK)
>>> Variable "inc_state" going out of scope leaks the storage it points to.
488 }
489 
490 /* Done with the buffer. */
491 pfree(buffer);
492 
493 return result;

It's right about that AFAICS, and not only is the "inc_state" itself
leaked but so is its assorted infrastructure.  Perhaps we don't care
too much about that in the existing applications, but ISTM that
isn't going to be a tenable assumption across the board.  Shouldn't
there be a "json_parse_manifest_incremental_shutdown()" or the like
to deallocate all the storage allocated by the parser?

regards, tom lane




Re: WIP Incremental JSON Parser

2024-04-05 Thread Andrew Dunstan



On 2024-04-05 Fr 11:43, Nathan Bossart wrote:

On Fri, Apr 05, 2024 at 10:15:45AM -0400, Andrew Dunstan wrote:

On 2024-04-04 Th 15:54, Nathan Bossart wrote:

On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:

Does the attached patch fix it for you?

It clears up 2 of the 3 warnings for me:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
   2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
|

Thanks, please try this instead.

LGTM, thanks!



Thanks for checking. Pushed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 10:15:45AM -0400, Andrew Dunstan wrote:
> On 2024-04-04 Th 15:54, Nathan Bossart wrote:
>> On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:
>> > Does the attached patch fix it for you?
>> It clears up 2 of the 3 warnings for me:
>> 
>> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
>> ../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may 
>> be used uninitialized in this function [-Werror=maybe-uninitialized]
>>   2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>>|
> 
> Thanks, please try this instead.

LGTM, thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WIP Incremental JSON Parser

2024-04-05 Thread Andrew Dunstan


On 2024-04-04 Th 15:54, Nathan Bossart wrote:

On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:

Does the attached patch fix it for you?

It clears up 2 of the 3 warnings for me:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
  2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
   |



Thanks, please try this instead.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 5f947dd618..44dbb7f7f9 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -279,6 +279,8 @@ IsValidJsonNumber(const char *str, int len)
 		return false;
 
 	dummy_lex.incremental = false;
+	dummy_lex.inc_state = NULL;
+	dummy_lex.pstack = NULL;
 
 	/*
 	 * json_lex_number expects a leading  '-' to have been eaten already.
@@ -297,6 +299,8 @@ IsValidJsonNumber(const char *str, int len)
 		dummy_lex.input_length = len;
 	}
 
+	dummy_lex.token_start = dummy_lex.input;
+
 	json_lex_number(_lex, dummy_lex.input, _error, _len);
 
 	return (!numeric_error) && (total_len == dummy_lex.input_length);
@@ -2018,6 +2022,9 @@ json_lex_number(JsonLexContext *lex, char *s,
 	{
 		appendBinaryStringInfo(>inc_state->partial_token,
 			   lex->token_start, s - lex->token_start);
+		if (num_err != NULL)
+			*num_err = error;
+
 		return JSON_INCOMPLETE;
 	}
 	else if (num_err != NULL)


Re: WIP Incremental JSON Parser

2024-04-04 Thread Jelte Fennema-Nio
On Thu, 4 Apr 2024 at 23:52, Jelte Fennema-Nio  wrote:
> Fair enough, I guess I'll change my invocation to include the ninja
> "test" target too:
> ninja -C build test install-quiet && meson test -C build

Actually that doesn't do what I want either because that actually runs
the tests too... And I prefer to do that using meson directly, so I
can add some filters to it. I'll figure something out when I'm more
fresh tomorrow.




Re: WIP Incremental JSON Parser

2024-04-04 Thread Jelte Fennema-Nio
On Thu, 4 Apr 2024 at 23:34, Jacob Champion
 wrote:
>
> On Thu, Apr 4, 2024 at 1:42 PM Jelte Fennema-Nio  wrote:
> > Maybe that's something worth addressing too. I expected that
> > install/install-quiet was a strict superset of a plain ninja
> > invocation.
>
> Maybe that's the intent, but I hope not, because I don't see any
> reason for `ninja install` to worry about test-only binaries that
> won't be installed. For the tests, there's a pretty clear rationale
> for the dependency.

Fair enough, I guess I'll change my invocation to include the ninja
"test" target too:
ninja -C build test install-quiet && meson test -C build




Re: WIP Incremental JSON Parser

2024-04-04 Thread Jacob Champion
On Thu, Apr 4, 2024 at 1:42 PM Jelte Fennema-Nio  wrote:
> Maybe that's something worth addressing too. I expected that
> install/install-quiet was a strict superset of a plain ninja
> invocation.

Maybe that's the intent, but I hope not, because I don't see any
reason for `ninja install` to worry about test-only binaries that
won't be installed. For the tests, there's a pretty clear rationale
for the dependency.

--Jacob




Re: WIP Incremental JSON Parser

2024-04-04 Thread Jelte Fennema-Nio
On Thu, 4 Apr 2024 at 20:48, Jacob Champion
 wrote:
>
> On Thu, Apr 4, 2024 at 11:12 AM Jacob Champion
>  wrote:
> > What's in the `...`? I wouldn't expect to find the test binary in your
> > tmp_install.
>
> Oh, I wonder if this is just a build dependency thing? I typically run
> a bare `ninja` right before testing, because I think most of those
> dependencies are missing for the tests at the moment. (For example, if
> I manually remove the `libpq_pipeline` executable and then try to run
> `meson test` without a rebuild, it fails.)
>
> We can certainly fix that (see attached patch as a first draft) but I
> wonder if there was a reason we decided not to during the Meson port?

Yeah, that patch fixed my problem. (as well as running plain ninja)

To clarify: I did do a rebuild. But it turns out that install-quiet
doesn't build everything... The full command I was having problems
with was:

ninja -C build install-quiet && meson test -C build

Maybe that's something worth addressing too. I expected that
install/install-quiet was a strict superset of a plain ninja
invocation.




Re: WIP Incremental JSON Parser

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 03:31:12PM -0400, Andrew Dunstan wrote:
> Does the attached patch fix it for you?

It clears up 2 of the 3 warnings for me:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2018:30: error: ‘dummy_lex.inc_state’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
 2018 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
  |

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WIP Incremental JSON Parser

2024-04-04 Thread Andrew Dunstan


On 2024-04-04 Th 14:06, Nathan Bossart wrote:

Apologies for piling on, but my compiler (gcc 9.4.0) is unhappy:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
  2016 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
   |   ~~~^~~
../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may 
be used uninitialized in this function [-Werror=maybe-uninitialized]
  2020 |   lex->token_start, s - lex->token_start);
   | ~~~^
../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
   302 |  return (!numeric_error) && (total_len == dummy_lex.input_length);
   | ~^~~~



Please pile on. I want to get this fixed.


It seems odd that my much later gcc didn't complain.


Does the attached patch fix it for you?


cheers


andrew







--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 5f947dd618..b003656ff8 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -297,6 +297,8 @@ IsValidJsonNumber(const char *str, int len)
 		dummy_lex.input_length = len;
 	}
 
+	dummy_lex.token_start = dummy_lex.input;
+
 	json_lex_number(_lex, dummy_lex.input, _error, _len);
 
 	return (!numeric_error) && (total_len == dummy_lex.input_length);
@@ -2018,6 +2020,9 @@ json_lex_number(JsonLexContext *lex, char *s,
 	{
 		appendBinaryStringInfo(>inc_state->partial_token,
 			   lex->token_start, s - lex->token_start);
+		if (num_err != NULL)
+			*num_err = false;
+
 		return JSON_INCOMPLETE;
 	}
 	else if (num_err != NULL)


Re: WIP Incremental JSON Parser

2024-04-04 Thread Jacob Champion
On Thu, Apr 4, 2024 at 11:06 AM Nathan Bossart  wrote:
> ../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
> ../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may 
> be used uninitialized in this function [-Werror=maybe-uninitialized]
>  2016 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
>   |   ~~~^~~
> ../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
>  2020 |   lex->token_start, s - lex->token_start);
>   | ~~~^

Ah, it hasn't figured out that the incremental path is disabled.
Zero-initializing the dummy_lex would be good regardless.

> ../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   302 |  return (!numeric_error) && (total_len == dummy_lex.input_length);
>   | ~^~~~

For this case, though, I think it'd be good for API consistency if
*numeric_error were always set on return, instead of relying on the
caller to initialize.

--Jacob




Re: WIP Incremental JSON Parser

2024-04-04 Thread Jacob Champion
On Thu, Apr 4, 2024 at 11:12 AM Jacob Champion
 wrote:
> What's in the `...`? I wouldn't expect to find the test binary in your
> tmp_install.

Oh, I wonder if this is just a build dependency thing? I typically run
a bare `ninja` right before testing, because I think most of those
dependencies are missing for the tests at the moment. (For example, if
I manually remove the `libpq_pipeline` executable and then try to run
`meson test` without a rebuild, it fails.)

We can certainly fix that (see attached patch as a first draft) but I
wonder if there was a reason we decided not to during the Meson port?

--Jacob


patch
Description: Binary data


Re: WIP Incremental JSON Parser

2024-04-04 Thread Jacob Champion
On Thu, Apr 4, 2024 at 10:17 AM Jelte Fennema-Nio  wrote:
> Command 'test_json_parser_incremental' not found in
> /home/jelte/opensource/postgres/build/tmp_install//home/jelte/.pgenv/pgsql-17beta9/bin,
> ...

I can't reproduce this locally...

What's in the `...`? I wouldn't expect to find the test binary in your
tmp_install.

--Jacob




Re: WIP Incremental JSON Parser

2024-04-04 Thread Nathan Bossart
Apologies for piling on, but my compiler (gcc 9.4.0) is unhappy:

../postgresql/src/common/jsonapi.c: In function ‘IsValidJsonNumber’:
../postgresql/src/common/jsonapi.c:2016:30: error: ‘dummy_lex.inc_state’ may be 
used uninitialized in this function [-Werror=maybe-uninitialized]
 2016 |  if (lex->incremental && !lex->inc_state->is_last_chunk &&
  |   ~~~^~~
../postgresql/src/common/jsonapi.c:2020:36: error: ‘dummy_lex.token_start’ may 
be used uninitialized in this function [-Werror=maybe-uninitialized]
 2020 |   lex->token_start, s - lex->token_start);
  | ~~~^
../postgresql/src/common/jsonapi.c:302:26: error: ‘numeric_error’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  302 |  return (!numeric_error) && (total_len == dummy_lex.input_length);
  | ~^~~~

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WIP Incremental JSON Parser

2024-04-04 Thread Jelte Fennema-Nio
On Thu, 4 Apr 2024 at 18:13, Andrew Dunstan  wrote:
> Argh. You get out of the habit when you're running with meson :-(

I'm having some issues with meson too actually. "meson test -C build"
is now failing with this for me:

Command 'test_json_parser_incremental' not found in
/home/jelte/opensource/postgres/build/tmp_install//home/jelte/.pgenv/pgsql-17beta9/bin,
...




Re: WIP Incremental JSON Parser

2024-04-04 Thread Andrew Dunstan



On 2024-04-04 Th 12:04, Tom Lane wrote:

Oh, more problems: after running check-world in a non-VPATH build,
there are droppings all over my tree:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
   (use "git add ..." to include in what will be committed)
 src/interfaces/ecpg/test/sql/sqljson_jsontable
 src/interfaces/ecpg/test/sql/sqljson_jsontable.c
 src/test/modules/test_json_parser/test_json_parser_incremental
 src/test/modules/test_json_parser/test_json_parser_perf
 src/test/modules/test_json_parser/tmp_check/

nothing added to commit but untracked files present (use "git add" to track)

"make clean" or "make distclean" removes some of that but not all:

$ make -s distclean
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
   (use "git add ..." to include in what will be committed)
 src/test/modules/test_json_parser/test_json_parser_incremental
 src/test/modules/test_json_parser/test_json_parser_perf

nothing added to commit but untracked files present (use "git add" to track)

So we're short several .gitignore entries, and apparently also
shy a couple of make-clean rules.



Argh. You get out of the habit when you're running with meson :-(


There's another issue I'm running down too.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-04 Thread Tom Lane
Oh, more problems: after running check-world in a non-VPATH build,
there are droppings all over my tree:

$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add ..." to include in what will be committed)
src/interfaces/ecpg/test/sql/sqljson_jsontable
src/interfaces/ecpg/test/sql/sqljson_jsontable.c
src/test/modules/test_json_parser/test_json_parser_incremental
src/test/modules/test_json_parser/test_json_parser_perf
src/test/modules/test_json_parser/tmp_check/

nothing added to commit but untracked files present (use "git add" to track)

"make clean" or "make distclean" removes some of that but not all:

$ make -s distclean
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add ..." to include in what will be committed)
src/test/modules/test_json_parser/test_json_parser_incremental
src/test/modules/test_json_parser/test_json_parser_perf

nothing added to commit but untracked files present (use "git add" to track)

So we're short several .gitignore entries, and apparently also
shy a couple of make-clean rules.

regards, tom lane




Re: WIP Incremental JSON Parser

2024-04-04 Thread Andrew Dunstan



On 2024-04-04 Th 11:16, Tom Lane wrote:

I wrote:

I think you just need to follow the standard pattern:

Yeah, the attached is enough to silence it for me.



Thanks, that's what I came up with too (after converting my setup to use 
clang)




(But personally I'd add comments saying that the typedef
appears in thus-and-such header file; see examples in
our tree.)



Done


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-04 Thread Tom Lane
I wrote:
> I think you just need to follow the standard pattern:

Yeah, the attached is enough to silence it for me.
(But personally I'd add comments saying that the typedef
appears in thus-and-such header file; see examples in
our tree.)

regards, tom lane

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 3d1bd37ac26..0bb46b43024 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -79,7 +79,7 @@ typedef enum
  * and the token and value for scalars that need to be preserved
  * across calls.
  */
-typedef struct JsonParserStack
+struct JsonParserStack
 {
 	int			stack_size;
 	char	   *prediction;
@@ -89,18 +89,18 @@ typedef struct JsonParserStack
 	bool	   *fnull;
 	JsonTokenType scalar_tok;
 	char	   *scalar_val;
-} JsonParserStack;
+};
 
 /*
  * struct containing state used when there is a possible partial token at the
  * end of a json chunk when we are doing incremental parsing.
  */
-typedef struct JsonIncrementalState
+struct JsonIncrementalState
 {
 	bool		is_last_chunk;
 	bool		partial_completed;
 	StringInfoData partial_token;
-} JsonIncrementalState;
+};
 
 /*
  * constants and macros used in the nonrecursive parser
diff --git a/src/common/parse_manifest.c b/src/common/parse_manifest.c
index 040c5597df4..f9c026a6369 100644
--- a/src/common/parse_manifest.c
+++ b/src/common/parse_manifest.c
@@ -91,12 +91,12 @@ typedef struct
 	char	   *manifest_checksum;
 } JsonManifestParseState;
 
-typedef struct JsonManifestParseIncrementalState
+struct JsonManifestParseIncrementalState
 {
 	JsonLexContext lex;
 	JsonSemAction sem;
 	pg_cryptohash_ctx *manifest_ctx;
-} JsonManifestParseIncrementalState;
+};
 
 static JsonParseErrorType json_manifest_object_start(void *state);
 static JsonParseErrorType json_manifest_object_end(void *state);


Re: WIP Incremental JSON Parser

2024-04-04 Thread Tom Lane
Andrew Dunstan  writes:
> On 2024-04-04 Th 10:26, Tom Lane wrote:
>> My animals don't like this [1][2]:

> Darn it. Ok, will try to fix.

I think you just need to follow the standard pattern:

typedef struct foo foo;

... foo* is ok to use here ...

struct foo {
   ... fields
};

regards, tom lane




Re: WIP Incremental JSON Parser

2024-04-04 Thread Andrew Dunstan



On 2024-04-04 Th 10:26, Tom Lane wrote:

Andrew Dunstan  writes:

pushed.

My animals don't like this [1][2]:

parse_manifest.c:99:3: error: redefinition of typedef 
'JsonManifestParseIncrementalState' is a C11 feature 
[-Werror,-Wtypedef-redefinition]
} JsonManifestParseIncrementalState;
   ^
../../src/include/common/parse_manifest.h:23:50: note: previous definition is 
here
typedef struct JsonManifestParseIncrementalState 
JsonManifestParseIncrementalState;
  ^
1 error generated.

(and similarly in other places)

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2024-04-04%2013%3A08%3A02
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka=2024-04-04%2013%3A07%3A01



Darn it. Ok, will try to fix.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-04 Thread Tom Lane
Andrew Dunstan  writes:
> pushed.

My animals don't like this [1][2]:

parse_manifest.c:99:3: error: redefinition of typedef 
'JsonManifestParseIncrementalState' is a C11 feature 
[-Werror,-Wtypedef-redefinition]
} JsonManifestParseIncrementalState;
  ^
../../src/include/common/parse_manifest.h:23:50: note: previous definition is 
here
typedef struct JsonManifestParseIncrementalState 
JsonManifestParseIncrementalState;
 ^
1 error generated.

(and similarly in other places)

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin=2024-04-04%2013%3A08%3A02
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka=2024-04-04%2013%3A07%3A01




Re: WIP Incremental JSON Parser

2024-04-04 Thread Andrew Dunstan



On 2024-04-02 Tu 17:12, Andrew Dunstan wrote:


On 2024-04-02 Tu 15:38, Jacob Champion wrote:
On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan  
wrote:
Anyway, here are new patches. I've rolled the new semantic test into 
the

first patch.

Looks good! I've marked RfC.



Thanks! I appreciate all the work you've done on this. I will give it 
one more pass and commit RSN.



pushed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-02 Thread Andrew Dunstan



On 2024-04-02 Tu 15:38, Jacob Champion wrote:

On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan  wrote:

Anyway, here are new patches. I've rolled the new semantic test into the
first patch.

Looks good! I've marked RfC.



Thanks! I appreciate all the work you've done on this. I will give it 
one more pass and commit RSN.





json_lex() is not really a very hot piece of code.

Sure, but I figure if someone is trying to get the performance of the
incremental parser to match the recursive one, so we can eventually
replace it, it might get a little warmer. :)


I don't think this is where the performance penalty lies. Rather, I 
suspect it's the stack operations in the non-recursive parser itself. 
The speed test doesn't involve any partial token processing at all, and 
yet the non-recursive parser is significantly slower in that test.



I think it'd be good for a v1.x of this feature to focus on
simplification of the code, and hopefully consolidate and/or eliminate
some of the duplicated parsing work so that the mental model isn't
quite so big.

I'm not sure how you think that can be done.

I think we'd need to teach the lower levels of the lexer about
incremental parsing too, so that we don't have two separate sources of
truth about what ends a token. Bonus points if we could keep the parse
state across chunks to the extent that we didn't need to restart at
the beginning of the token every time. (Our current tools for this are
kind of poor, like the restartable state machine in PQconnectPoll.
While I'm dreaming, I'd like coroutines.) Now, whether the end result
would be more or less maintainable is left as an exercise...



I tried to disturb the main lexer processing as little as possible. We 
could possibly unify the two paths, but I have a strong suspicion that 
would result in a performance hit (the main part of the lexer doesn't 
copy any characters at all, it just keeps track of pointers into the 
input). And while the code might not undergo lots of change, the routine 
itself is quite performance critical.


Anyway, I think that's all something for another day.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-04-02 Thread Jacob Champion
On Mon, Apr 1, 2024 at 4:53 PM Andrew Dunstan  wrote:
> Anyway, here are new patches. I've rolled the new semantic test into the
> first patch.

Looks good! I've marked RfC.

> json_lex() is not really a very hot piece of code.

Sure, but I figure if someone is trying to get the performance of the
incremental parser to match the recursive one, so we can eventually
replace it, it might get a little warmer. :)

> > I think it'd be good for a v1.x of this feature to focus on
> > simplification of the code, and hopefully consolidate and/or eliminate
> > some of the duplicated parsing work so that the mental model isn't
> > quite so big.
>
> I'm not sure how you think that can be done.

I think we'd need to teach the lower levels of the lexer about
incremental parsing too, so that we don't have two separate sources of
truth about what ends a token. Bonus points if we could keep the parse
state across chunks to the extent that we didn't need to restart at
the beginning of the token every time. (Our current tools for this are
kind of poor, like the restartable state machine in PQconnectPoll.
While I'm dreaming, I'd like coroutines.) Now, whether the end result
would be more or less maintainable is left as an exercise...

Thanks!
--Jacob




Re: WIP Incremental JSON Parser

2024-03-29 Thread Jacob Champion
On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan  wrote:
> Here's a new set of patches, with I think everything except the error case 
> Asserts attended to. There's a change to add semantic processing to the test 
> suite in patch 4, but I'd fold that into patch 1 when committing.

Thanks! 0004 did highlight one last bug for me -- the return value
from semantic callbacks is ignored, so applications can't interrupt
the parsing early:

> +   if (ostart != NULL)
> +   (*ostart) (sem->semstate);
> +   inc_lex_level(lex);

Anything not JSON_SUCCESS should be returned immediately, I think, for
this and the other calls.

> +   /* make sure we've used all the input */
> +   if (lex->token_terminator - lex->token_start != ptok->len)
> +   return JSON_INVALID_TOKEN;

nit: Debugging this, if anyone ever hits it, is going to be confusing.
The error message is going to claim that a valid token is invalid, as
opposed to saying that the parser has a bug. Short of introducing
something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
at least?

> +   case JSON_NESTING_TOO_DEEP:
> +   return(_("Json nested too deep, maximum permitted depth is 
> 6400"));

nit: other error messages use all-caps, "JSON"

> +   charchunk_start[100],
> +   chunk_end[100];
> +
> +   snprintf(chunk_start, 100, "%s", ib->buf.data);
> +   snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - 
> (MIN_CHUNK + 99)));
> +   elog(NOTICE, "incremental 
> manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);

Is this NOTICE intended to be part of the final commit?

> +   inc_state = json_parse_manifest_incremental_init();
> +
> +   buffer = pg_malloc(chunk_size + 64);

These `+ 64`s (there are two of them) seem to be left over from
debugging. In v7 they were just `+ 1`.

--

>From this point onward, I think all of my feedback is around
maintenance characteristics, which is firmly in YMMV territory, and I
don't intend to hold up a v1 of the feature for it since I'm not the
maintainer. :D

The complexity around the checksum handling and "final chunk"-ing is
unfortunate, but not a fault of this patch -- just a weird consequence
of the layering violation in the format, where the checksum is inside
the object being checksummed. I don't like it, but I don't think
there's much to be done about it.

By far the trickiest part of the implementation is the partial token
logic, because of all the new ways there are to handle different types
of failures. I think any changes to the incremental handling in
json_lex() are going to need intense scrutiny from someone who already
has the mental model loaded up. I'm going snowblind on the patch and
I'm no longer the right person to review how hard it is to get up to
speed with the architecture, but I'd say it's not easy.

For something as security-critical as a JSON parser, I'd usually want
to counteract that by sinking the observable behaviors in concrete and
getting both the effective test coverage *and* the code coverage as
close to 100% as we can stand. (By that, I mean that purposely
introducing observable bugs into the parser should result in test
failures. We're not there yet when it comes to the semantic callbacks,
at least.) But I don't think the current JSON parser is held to that
standard currently, so it seems strange for me to ask for that here.

I think it'd be good for a v1.x of this feature to focus on
simplification of the code, and hopefully consolidate and/or eliminate
some of the duplicated parsing work so that the mental model isn't
quite so big.

Thanks!
--Jacob




Re: WIP Incremental JSON Parser

2024-03-29 Thread Andrew Dunstan



On 2024-03-26 Tu 17:52, Jacob Champion wrote:

On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion
 wrote:

- add Assert calls in impossible error cases [2]

To expand on this one, I think these parts of the code (marked with
`<- here`) are impossible to reach:


switch (top)
{
 case JSON_TOKEN_STRING:
 if (next_prediction(pstack) == JSON_TOKEN_COLON)
 ctx = JSON_PARSE_STRING;
 else
 ctx = JSON_PARSE_VALUE;<- here
 break;
 case JSON_TOKEN_NUMBER:<- here
 case JSON_TOKEN_TRUE:  <- here
 case JSON_TOKEN_FALSE: <- here
 case JSON_TOKEN_NULL:  <- here
 case JSON_TOKEN_ARRAY_START:   <- here
 case JSON_TOKEN_OBJECT_START:  <- here
 ctx = JSON_PARSE_VALUE;
 break;
 case JSON_TOKEN_ARRAY_END: <- here
 ctx = JSON_PARSE_ARRAY_NEXT;
 break;
 case JSON_TOKEN_OBJECT_END:<- here
 ctx = JSON_PARSE_OBJECT_NEXT;
 break;
 case JSON_TOKEN_COMMA: <- here
 if (next_prediction(pstack) == JSON_TOKEN_STRING)
 ctx = JSON_PARSE_OBJECT_NEXT;
 else
 ctx = JSON_PARSE_ARRAY_NEXT;
 break;

Since none of these cases are non-terminals, the only way to get to
this part of the code is if (top != tok). But inspecting the
productions and transitions that can put these tokens on the stack, it
looks like the only way for them to be at the top of the stack to
begin with is if (tok == top). (Otherwise, we would have chosen a
different production, or else errored out on a non-terminal.)

This case is possible...


 case JSON_TOKEN_STRING:
 if (next_prediction(pstack) == JSON_TOKEN_COLON)
 ctx = JSON_PARSE_STRING;

...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the
corresponding else case is not, because if we're in the middle of a
_KEY_PAIRS production, the next_prediction() _must_ be
JSON_TOKEN_COLON.

Do you agree, or am I missing a way to get there?




One good way of testing would be to add the Asserts, build with 
-DFORCE_JSON_PSTACK, and run the standard regression suite, which has a 
fairly comprehensive set of JSON errors. I'll play with that.


cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-03-26 Thread Jacob Champion
On Mon, Mar 25, 2024 at 3:14 PM Jacob Champion
 wrote:
> - add Assert calls in impossible error cases [2]

To expand on this one, I think these parts of the code (marked with
`<- here`) are impossible to reach:

> switch (top)
> {
> case JSON_TOKEN_STRING:
> if (next_prediction(pstack) == JSON_TOKEN_COLON)
> ctx = JSON_PARSE_STRING;
> else
> ctx = JSON_PARSE_VALUE;<- here
> break;
> case JSON_TOKEN_NUMBER:<- here
> case JSON_TOKEN_TRUE:  <- here
> case JSON_TOKEN_FALSE: <- here
> case JSON_TOKEN_NULL:  <- here
> case JSON_TOKEN_ARRAY_START:   <- here
> case JSON_TOKEN_OBJECT_START:  <- here
> ctx = JSON_PARSE_VALUE;
> break;
> case JSON_TOKEN_ARRAY_END: <- here
> ctx = JSON_PARSE_ARRAY_NEXT;
> break;
> case JSON_TOKEN_OBJECT_END:<- here
> ctx = JSON_PARSE_OBJECT_NEXT;
> break;
> case JSON_TOKEN_COMMA: <- here
> if (next_prediction(pstack) == JSON_TOKEN_STRING)
> ctx = JSON_PARSE_OBJECT_NEXT;
> else
> ctx = JSON_PARSE_ARRAY_NEXT;
> break;

Since none of these cases are non-terminals, the only way to get to
this part of the code is if (top != tok). But inspecting the
productions and transitions that can put these tokens on the stack, it
looks like the only way for them to be at the top of the stack to
begin with is if (tok == top). (Otherwise, we would have chosen a
different production, or else errored out on a non-terminal.)

This case is possible...

> case JSON_TOKEN_STRING:
> if (next_prediction(pstack) == JSON_TOKEN_COLON)
> ctx = JSON_PARSE_STRING;

...if we're in the middle of JSON_PROD_[MORE_]KEY_PAIRS. But the
corresponding else case is not, because if we're in the middle of a
_KEY_PAIRS production, the next_prediction() _must_ be
JSON_TOKEN_COLON.

Do you agree, or am I missing a way to get there?

Thanks,
--Jacob




Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion
On Mon, Mar 25, 2024 at 4:24 PM Andrew Dunstan  wrote:
> OK, so we invent a new error code and have the parser  return that if the 
> stack depth gets too big?

Yeah, that seems reasonable. I'd potentially be able to build on that
via OAuth for next cycle, too, since that client needs to limit its
memory usage.

--Jacob




Re: WIP Incremental JSON Parser

2024-03-25 Thread Andrew Dunstan
On Mon, Mar 25, 2024 at 7:12 PM Jacob Champion <
jacob.champ...@enterprisedb.com> wrote:

> On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan 
> wrote:
> > Well, what's the alternative? The current parser doesn't check stack
> depth in frontend code. Presumably it too will eventually just run out of
> memory, possibly rather sooner as the stack frames could  be more expensive
> than the incremental parser stack extensions.
>
> Stack size should be pretty limited, at least on the platforms I'm
> familiar with. So yeah, the recursive descent will segfault pretty
> quickly, but it won't repalloc() an unbounded amount of heap space.
> The alternative would just be to go back to a hardcoded limit in the
> short term, I think.
>
>
>
OK, so we invent a new error code and have the parser  return that if the
stack depth gets too big?

cheers

andrew


Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion
On Mon, Mar 25, 2024 at 4:12 PM Jacob Champion
 wrote:
> Stack size should be pretty limited, at least on the platforms I'm
> familiar with. So yeah, the recursive descent will segfault pretty
> quickly, but it won't repalloc() an unbounded amount of heap space.
> The alternative would just be to go back to a hardcoded limit in the
> short term, I think.

And I should mention that there are other ways to consume a bunch of
memory, but I think they're bounded by the size of the JSON file.
Looks like the repalloc()s amplify the JSON size by a factor of ~20
(JS_MAX_PROD_LEN + sizeof(char*) + sizeof(bool)). That may or may not
be enough to be concerned about in the end, since I think it's still
linear, but I wanted to make sure it was known.

--Jacob




Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion
On Mon, Mar 25, 2024 at 4:02 PM Andrew Dunstan  wrote:
> Well, what's the alternative? The current parser doesn't check stack depth in 
> frontend code. Presumably it too will eventually just run out of memory, 
> possibly rather sooner as the stack frames could  be more expensive than the 
> incremental parser stack extensions.

Stack size should be pretty limited, at least on the platforms I'm
familiar with. So yeah, the recursive descent will segfault pretty
quickly, but it won't repalloc() an unbounded amount of heap space.
The alternative would just be to go back to a hardcoded limit in the
short term, I think.

--Jacob




Re: WIP Incremental JSON Parser

2024-03-25 Thread Andrew Dunstan
On Mon, Mar 25, 2024 at 6:15 PM Jacob Champion <
jacob.champ...@enterprisedb.com> wrote:

> On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan 
> wrote:
> > Thanks, included that and attended to the other issues we discussed. I
> think this is pretty close now.
>
> Okay, looking over the thread, there are the following open items:
> - extend the incremental test in order to exercise the semantic callbacks
> [1]
>


Yeah, I'm on a super long plane trip later this week, so I might get it
done then :-)


> - add Assert calls in impossible error cases [2]
>

ok, will do


> - error out if the non-incremental lex doesn't consume the entire token [2]
>

ok, will do


> - double-check that out of memory is an appropriate failure mode for
> the frontend [3]
>


Well, what's the alternative? The current parser doesn't check stack depth
in frontend code. Presumably it too will eventually just run out of memory,
possibly rather sooner as the stack frames could  be more expensive than
the incremental parser stack extensions.



>
> Just as a general style nit:
>
> > +   if (lex->incremental)
> > +   {
> > +   lex->input = lex->token_terminator = lex->line_start = json;
> > +   lex->input_length = len;
> > +   lex->inc_state->is_last_chunk = is_last;
> > +   }
> > +   else
> > +   return JSON_INVALID_LEXER_TYPE;
>
> I think flipping this around would probably make it more readable;
> something like:
>
> if (!lex->incremental)
> return JSON_INVALID_LEXER_TYPE;
>
> lex->input = ...
>
>
>
Noted. will do, Thanks.

cheers

andrew



> [1]
> https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
> [3]
> https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com
>


Re: WIP Incremental JSON Parser

2024-03-25 Thread Jacob Champion
On Wed, Mar 20, 2024 at 11:56 PM Andrew Dunstan  wrote:
> Thanks, included that and attended to the other issues we discussed. I think 
> this is pretty close now.

Okay, looking over the thread, there are the following open items:
- extend the incremental test in order to exercise the semantic callbacks [1]
- add Assert calls in impossible error cases [2]
- error out if the non-incremental lex doesn't consume the entire token [2]
- double-check that out of memory is an appropriate failure mode for
the frontend [3]

Just as a general style nit:

> +   if (lex->incremental)
> +   {
> +   lex->input = lex->token_terminator = lex->line_start = json;
> +   lex->input_length = len;
> +   lex->inc_state->is_last_chunk = is_last;
> +   }
> +   else
> +   return JSON_INVALID_LEXER_TYPE;

I think flipping this around would probably make it more readable;
something like:

if (!lex->incremental)
return JSON_INVALID_LEXER_TYPE;

lex->input = ...

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/CAOYmi%2BnHV55Uhz%2Bo-HKq0GNiWn2L5gMcuyRQEz_fqpGY%3DpFxKA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAD5tBcLi2ffZkktV2qrsKSBykE-N8CiYgrfbv0vZ-F7%3DxLFeqw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CAOYmi%2BnY%3DrF6dJCzaOuA3d-3FbwXCcecOs_S1NutexFA3dRXAw%40mail.gmail.com




Re: WIP Incremental JSON Parser

2024-03-20 Thread Jacob Champion
This new return path...

> +   if (!tok_done)
> +   {
> +   if (lex->inc_state->is_last_chunk)
> +   return JSON_INVALID_TOKEN;

...also needs to set the token pointers. See one approach in the
attached diff, which additionally asserts that we've consumed the
entire chunk in this case, along with a handful of new tests.

--Jacob
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 4285b6557c..5babdd0e63 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -1378,10 +1378,16 @@ json_lex(JsonLexContext *lex)
 
if (!tok_done)
{
-   if (lex->inc_state->is_last_chunk)
-   return JSON_INVALID_TOKEN;
-   else
+   /* We should have consumed the whole chunk in this 
case. */
+   Assert(added == lex->input_length);
+
+   if (!lex->inc_state->is_last_chunk)
return JSON_INCOMPLETE;
+
+   /* json_errdetail() needs access to the accumulated 
token. */
+   lex->token_start = ptok->data;
+   lex->token_terminator = ptok->data + ptok->len;
+   return JSON_INVALID_TOKEN;
}
 
lex->input += added;
diff --git a/src/test/modules/test_json_parser/t/002_inline.pl 
b/src/test/modules/test_json_parser/t/002_inline.pl
index 7223287000..03e5d30187 100644
--- a/src/test/modules/test_json_parser/t/002_inline.pl
+++ b/src/test/modules/test_json_parser/t/002_inline.pl
@@ -42,7 +42,8 @@ sub test
 
 test("number", "12345");
 test("string", '"hello"');
-test("boolean", "false");
+test("false", "false");
+test("true", "true");
 test("null", "null");
 test("empty object", "{}");
 test("empty array", "[]");
@@ -53,6 +54,9 @@ test("array with string", '["hello"]');
 test("array with boolean", '[false]');
 test("single pair", '{"key": "value"}');
 test("heavily nested array", "[" x 3200 . "]" x 3200);
+test("serial escapes", '""');
+test("interrupted escapes", '"\\"\\""');
+test("whitespace", ' "" ');
 
 test("unclosed empty object", "{", error => qr/input string ended 
unexpectedly/);
 test("bad key", "{{", error => qr/Expected string or "}", but found "\{"/);
@@ -75,5 +79,6 @@ test("smashed top-level scalar", "12zz", error => qr/Token 
"12zz" is invalid/);
 test("smashed scalar in array", "[12zz]", error => qr/Token "12zz" is 
invalid/);
 test("unknown escape sequence", '"hello\vworld"', error => qr/Escape sequence 
"\\v" is invalid/);
 test("unescaped control", "\"hello\tworld\"", error => qr/Character with value 
0x09 must be escaped/);
+test("incorrect escape count", '"\\"', error => qr/Token 
""\\"" is invalid/);
 
 done_testing();


Re: WIP Incremental JSON Parser

2024-03-20 Thread Jacob Champion
On Tue, Mar 19, 2024 at 3:07 PM Andrew Dunstan  wrote:
> On Mon, Mar 18, 2024 at 3:35 PM Jacob Champion 
>  wrote:
>> With the incremental parser, I think prev_token_terminator is not
>> likely to be safe to use except in very specific circumstances, since
>> it could be pointing into a stale chunk. Some documentation around how
>> to use that safely in a semantic action would be good.
>
> Quite right. It's not safe. Should we ensure it's set to something like NULL 
> or -1?

Nulling it out seems reasonable.

> Also, where do you think we should put a warning about it?

I was thinking in the doc comment for JsonLexContext.

> It also removes the frontend exits I had. In the case of stack depth, we 
> follow the example of the RD parser and only check stack depth for backend 
> code. In the case of the check that the lexer is set up for incremental 
> parsing, the exit is replaced by an Assert. That means your test for an 
> over-nested array doesn't work any more, so I have commented it out.

Hm, okay. We really ought to fix the recursive parser, but that's for
a separate thread. (Probably OAuth.) The ideal behavior IMO would be
for the caller to configure a maximum depth in the JsonLexContext.

Note that the repalloc will eventually still exit() if the pstack gets
too big; is that a concern? Alternatively, could unbounded heap growth
be a problem for a superuser? I guess the scalars themselves aren't
capped for length...

On Wed, Mar 20, 2024 at 12:19 AM Andrew Dunstan  wrote:
> On second thoughts, I think it might be better if we invent a new error 
> return code for a lexer mode mismatch.

+1

--Jacob




Re: WIP Incremental JSON Parser

2024-03-20 Thread Andrew Dunstan
On Tue, Mar 19, 2024 at 6:07 PM Andrew Dunstan  wrote:

>
>
>
>
> It also removes the frontend exits I had. In the case of stack depth, we
> follow the example of the RD parser and only check stack depth for backend
> code. In the case of the check that the lexer is set up for incremental
> parsing, the exit is replaced by an Assert.
>
>
On second thoughts, I think it might be better if we invent a new error
return code for a lexer mode mismatch.

cheers

andrew


Re: WIP Incremental JSON Parser

2024-03-18 Thread Jacob Champion
On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan  wrote:
> Not very easily. But I think and hope I've fixed the issue you've identified 
> above about returning before lex->token_start is properly set.
>
>  Attached is a new set of patches that does that and is updated for the 
> json_errdetaiil() changes.

Thanks!

>++   * Normally token_start would be ptok->data, but it could be 
> later,
>++   * see json_lex_string's handling of invalid escapes.
> +   */
>-+  lex->token_start = ptok->data;
>++  lex->token_start = dummy_lex.token_start;
> +  lex->token_terminator = ptok->data + ptok->len;

By the same token (ha), the lex->token_terminator needs to be updated
from dummy_lex for some error paths. (IIUC, on success, the
token_terminator should always point to the end of the buffer. If it's
not possible to combine the two code paths, maybe it'd be good to
check that and assert/error out if we've incorrectly pulled additional
data into the partial token.)

With the incremental parser, I think prev_token_terminator is not
likely to be safe to use except in very specific circumstances, since
it could be pointing into a stale chunk. Some documentation around how
to use that safely in a semantic action would be good.

It looks like some of the newly added error handling paths cannot be
hit, because the production stack makes it logically impossible to get
there. (For example, if it takes a successfully lexed comma to
transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when
we pull that production's JSON_TOKEN_COMMA off the stack, we can't
somehow fail to match that same comma.) Assuming I haven't missed a
different way to get into that situation, could the "impossible" cases
have assert calls added?

I've attached two diffs. One is the group of tests I've been using
locally (called 002_inline.pl; I replaced the existing inline tests
with it), and the other is a set of potential fixes to get those tests
green.

Thanks,
--Jacob
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 8273b35b01..f7608f84b9 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -928,11 +928,12 @@ pg_parse_json_incremental(JsonLexContext *lex,
case JSON_TOKEN_END:
ctx = JSON_PARSE_END;
break;
-
-   /*
-* mostly we don't need to worry about 
non-terminals here,
-* but there are a few cases where we 
do.
-*/
+   case JSON_NT_MORE_ARRAY_ELEMENTS:
+   ctx = JSON_PARSE_ARRAY_NEXT;
+   break;
+   case JSON_NT_ARRAY_ELEMENTS:
+   ctx = JSON_PARSE_ARRAY_START;
+   break;
case JSON_NT_MORE_KEY_PAIRS:
ctx = JSON_PARSE_OBJECT_NEXT;
break;
@@ -1412,7 +1413,7 @@ json_lex(JsonLexContext *lex)
 * see json_lex_string's handling of invalid escapes.
 */
lex->token_start = dummy_lex.token_start;
-   lex->token_terminator = ptok->data + ptok->len;
+   lex->token_terminator = dummy_lex.token_terminator;
if (partial_result == JSON_SUCCESS)
lex->inc_state->partial_completed = true;
return partial_result;
commit 3d615593d6d79178a8b8a208172414806d40cc03
Author: Jacob Champion 
Date:   Mon Mar 11 06:13:47 2024 -0700

WIP: test many different inline cases

diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
index a5bedce94e..2563075c34 100644
--- a/src/test/modules/test_json_parser/meson.build
+++ b/src/test/modules/test_json_parser/meson.build
@@ -45,6 +45,7 @@ tests += {
   'tap': {
 'tests': [
   't/001_test_json_parser_incremental.pl',
+  't/002_inline.pl',
 ],
   },
 }
diff --git 
a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl 
b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
index 477198b843..24f665c021 100644
--- a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
+++ b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
@@ -12,8 +12,6 @@ my $test_file = "$FindBin::RealBin/../tiny.json";
 
 my $exe = "test_json_parser_incremental";
 
-my @inlinetests = ("false", "12345", '"a"', "[1,2]");
-
 for (my $size = 64; $size > 0; $size--)
 {
my ($stdout, $stderr) = run_command( [$exe, "-c", $size, $test_file] );
@@ -22,21 +20,4 @@ for (my $size = 64; $size > 0; $size--)
is($stderr, "", 

Re: WIP Incremental JSON Parser

2024-03-14 Thread Jacob Champion
I've been poking at the partial token logic. The json_errdetail() bug
mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk
size) seems to be due to the disconnection between the "main" lex
instance and the dummy_lex that's created from it. The dummy_lex
contains all the information about the failed token, which is
discarded upon an error return:

> partial_result = json_lex(_lex);
> if (partial_result != JSON_SUCCESS)
> return partial_result;

In these situations, there's an additional logical error:
lex->token_start is pointing to a spot in the string after
lex->token_terminator, which breaks an invariant that will mess up
later pointer math. Nothing appears to be setting lex->token_start to
point into the partial token buffer until _after_ the partial token is
successfully lexed, which doesn't seem right -- in addition to the
pointer math problems, if a previous chunk was freed (or on a stale
stack frame), lex->token_start will still be pointing off into space.
Similarly, wherever we set token_terminator, we need to know that
token_start is pointing into the same buffer.

Determining the end of a token is now done in two separate places
between the partial- and full-lexer code paths, which is giving me a
little heartburn. I'm concerned that those could drift apart, and if
the two disagree on where to end a token, we could lose data into the
partial token buffer in a way that would be really hard to debug. Is
there a way to combine them?

Thanks,
--Jacob




Re: WIP Incremental JSON Parser

2024-03-11 Thread Jacob Champion
On Sun, Mar 10, 2024 at 11:43 PM Andrew Dunstan  wrote:
> I haven't managed to reproduce this. But I'm including some tests for it.

If I remove the newline from the end of the new tests:

> @@ -25,7 +25,7 @@ for (my $size = 64; $size > 0; $size--)
>  foreach my $test_string (@inlinetests)
>  {
> my ($fh, $fname) = tempfile();
> -   print $fh "$test_string\n";
> +   print $fh "$test_string";
> close($fh);
>
> foreach my $size (1..6, 10, 20, 30)

then I can reproduce the same result as my local tests. That extra
whitespace appears to help the partial token logic out somehow.

--Jacob




Re: WIP Incremental JSON Parser

2024-03-07 Thread Andrew Dunstan



On 2024-03-07 Th 10:28, Jacob Champion wrote:

Some more observations as I make my way through the patch:

In src/common/jsonapi.c,


+#define JSON_NUM_NONTERMINALS 6

Should this be 5 now?



Yep.





+   res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem),
+   chunk, 
size, is_last);
+
+   expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE;
+
+   if (res != expected)
+   json_manifest_parse_failure(context, "parsing failed");

This leads to error messages like

 pg_verifybackup: error: could not parse backup manifest: parsing failed

which I would imagine is going to lead to confused support requests in
the event that someone does manage to corrupt their manifest. Can we
make use of json_errdetail() and print the line and column numbers?
Patch 0001 over at [1] has one approach to making json_errdetail()
workable in frontend code.



Looks sound on a first look. Maybe we should get that pushed ASAP so we 
can take advantage of it.






Top-level scalars like `false` or `12345` do not parse correctly if
the chunk size is too small; instead json_errdetail() reports 'Token
"" is invalid'. With small chunk sizes, json_errdetail() additionally
segfaults on constructions like `[tru]` or `12zz`.



Ugh. Will investigate.




For my local testing, I'm carrying the following diff in
001_test_json_parser_incremental.pl:


- ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
- ok(!$stderr, "chunk size $size: no error output");
+ like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds");
+ is($stderr, "", "chunk size $size: no error output");

This is particularly helpful when a test fails spuriously due to code
coverage spray on stderr.



Makes sense, thanks.


I'll have a fresh patch set soon which will also take care of the bitrot.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-03-07 Thread Jacob Champion
Some more observations as I make my way through the patch:

In src/common/jsonapi.c,

> +#define JSON_NUM_NONTERMINALS 6

Should this be 5 now?

> +   res = pg_parse_json_incremental(&(incstate->lex), &(incstate->sem),
> +   
> chunk, size, is_last);
> +
> +   expected = is_last ? JSON_SUCCESS : JSON_INCOMPLETE;
> +
> +   if (res != expected)
> +   json_manifest_parse_failure(context, "parsing failed");

This leads to error messages like

pg_verifybackup: error: could not parse backup manifest: parsing failed

which I would imagine is going to lead to confused support requests in
the event that someone does manage to corrupt their manifest. Can we
make use of json_errdetail() and print the line and column numbers?
Patch 0001 over at [1] has one approach to making json_errdetail()
workable in frontend code.

Top-level scalars like `false` or `12345` do not parse correctly if
the chunk size is too small; instead json_errdetail() reports 'Token
"" is invalid'. With small chunk sizes, json_errdetail() additionally
segfaults on constructions like `[tru]` or `12zz`.

For my local testing, I'm carrying the following diff in
001_test_json_parser_incremental.pl:

> - ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
> - ok(!$stderr, "chunk size $size: no error output");
> + like($stdout, qr/SUCCESS/, "chunk size $size: test succeeds");
> + is($stderr, "", "chunk size $size: no error output");

This is particularly helpful when a test fails spuriously due to code
coverage spray on stderr.

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/CAOYmi%2BmSSY4SvOtVN7zLyUCQ4-RDkxkzmTuPEN%2Bt-PsB7GHnZA%40mail.gmail.com




Re: WIP Incremental JSON Parser

2024-02-27 Thread Jacob Champion
On Mon, Feb 26, 2024 at 9:20 PM Andrew Dunstan  wrote:
> The good news is that the parser is doing fine - this issue was due to a
> thinko on my part in the test program that got triggered by the input
> file size being an exact multiple of the chunk size. I'll have a new
> patch set later this week.

Ah, feof()! :D Good to know it's not the partial token logic.

--Jacob




Re: WIP Incremental JSON Parser

2024-02-26 Thread Andrew Dunstan



On 2024-02-26 Mo 19:20, Andrew Dunstan wrote:


On 2024-02-26 Mo 10:10, Jacob Champion wrote:

On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion
 wrote:

As a brute force example of the latter, with the attached diff I get
test failures at chunk sizes 1, 2, 3, 4, 6, and 12.

But this time with the diff.



Ouch. I'll check it out. Thanks!





The good news is that the parser is doing fine - this issue was due to a 
thinko on my part in the test program that got triggered by the input 
file size being an exact multiple of the chunk size. I'll have a new 
patch set later this week.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-02-26 Thread Andrew Dunstan



On 2024-02-26 Mo 10:10, Jacob Champion wrote:

On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion
 wrote:

As a brute force example of the latter, with the attached diff I get
test failures at chunk sizes 1, 2, 3, 4, 6, and 12.

But this time with the diff.



Ouch. I'll check it out. Thanks!


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-02-26 Thread Jacob Champion
On Mon, Feb 26, 2024 at 7:08 AM Jacob Champion
 wrote:
> As a brute force example of the latter, with the attached diff I get
> test failures at chunk sizes 1, 2, 3, 4, 6, and 12.

But this time with the diff.

--Jacob
diff --git 
a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl 
b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
index 8eeb7f5b91..08c280037f 100644
--- a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
+++ b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
@@ -10,11 +10,13 @@ my $test_file = "$FindBin::RealBin/../tiny.json";
 
 my $exe = "test_json_parser_incremental";
 
-my ($stdout, $stderr) = run_command( [$exe, $test_file] );
-
-ok($stdout =~ /SUCCESS/, "test succeeds");
-ok(!$stderr, "no error output");
+for (my $size = 64; $size > 0; $size--)
+{
+   my ($stdout, $stderr) = run_command( [$exe, "-c", $size, $test_file] );
 
+   ok($stdout =~ /SUCCESS/, "chunk size $size: test succeeds");
+   ok(!$stderr, "chunk size $size: no error output");
+}
 
 done_testing();
 
diff --git a/src/test/modules/test_json_parser/test_json_parser_incremental.c 
b/src/test/modules/test_json_parser/test_json_parser_incremental.c
index dee5c6f7d1..a94cc8adb8 100644
--- a/src/test/modules/test_json_parser/test_json_parser_incremental.c
+++ b/src/test/modules/test_json_parser/test_json_parser_incremental.c
@@ -12,6 +12,7 @@
  * the parser in very small chunks. In practice you would normally use
  * much larger chunks, but doing this makes it more likely that the
  * full range of incement handling, especially in the lexer, is exercised.
+ * If the "-c SIZE" option is provided, that chunk size is used instead.
  *
  * The argument specifies the file containing the JSON input.
  *
@@ -34,12 +35,19 @@ main(int argc, char **argv)
JsonLexContext lex;
StringInfoData json;
int n_read;
+   size_t  chunk_size = 60;
+
+   if (strcmp(argv[1], "-c") == 0)
+   {
+   sscanf(argv[2], "%zu", _size);
+   argv += 2;
+   }
 
makeJsonLexContextIncremental(, PG_UTF8, false);
initStringInfo();
 
json_file = fopen(argv[1], "r");
-   while ((n_read = fread(buff, 1, 60, json_file)) > 0)
+   while ((n_read = fread(buff, 1, chunk_size, json_file)) > 0)
{
appendBinaryStringInfo(, buff, n_read);
appendStringInfoString(, "1+23 trailing junk");


Re: WIP Incremental JSON Parser

2024-02-26 Thread Jacob Champion
On Thu, Feb 22, 2024 at 3:43 PM Andrew Dunstan  wrote:
> > Are there plans to fill out the test suite more? Since we should be
> > able to control all the initial conditions, it'd be good to get fairly
> > comprehensive coverage of the new code.
>
> Well, it's tested (as we know) by the backup manifest tests. During
> development, I tested by making the regular parser use the
> non-recursive  parser (see FORCE_JSON_PSTACK). That doesn't test the
> incremental piece of it, but it does check that the rest of it is doing
> the right thing. We could probably extend the incremental test by making
> it output a stream of tokens and making sure that was correct.

That would also cover all the semantic callbacks (currently,
OFIELD_END and AELEM_* are uncovered), so +1 from me.

Looking at lcov, it'd be good to
- test failure modes as well as the happy path, so we know we're
rejecting invalid syntax correctly
- test the prediction stack resizing code
- provide targeted coverage of the partial token support, since at the
moment we're at the mercy of the manifest format and the default chunk
size

As a brute force example of the latter, with the attached diff I get
test failures at chunk sizes 1, 2, 3, 4, 6, and 12.

> > As an aside, I find the behavior of need_escapes=false to be
> > completely counterintuitive. I know the previous code did this, but it
> > seems like the opposite of "provides unescaped strings" should be
> > "provides raw strings", not "all strings are now NULL".
>
> Yes, we could possibly call it "need_strings"  or something like that.

+1

--Jacob




Re: WIP Incremental JSON Parser

2024-02-22 Thread Andrew Dunstan



On 2024-02-22 Th 15:29, Jacob Champion wrote:

On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan  wrote:

Patch 5 in this series fixes those issues and
adjusts most of the tests to add some trailing junk to the pieces of
json, so we can be sure that this is done right.

This fixes the test failure for me, thanks! I've attached my current
mesonification diff, which just adds test_json_parser to the suite. It
relies on the PATH that's set up, which appears to include the build
directory for both VPATH builds and Meson.




OK, thanks, will add this in the next version.




Are there plans to fill out the test suite more? Since we should be
able to control all the initial conditions, it'd be good to get fairly
comprehensive coverage of the new code.




Well, it's tested (as we know) by the backup manifest tests. During 
development, I tested by making the regular parser use the 
non-recursive  parser (see FORCE_JSON_PSTACK). That doesn't test the 
incremental piece of it, but it does check that the rest of it is doing 
the right thing. We could probably extend the incremental test by making 
it output a stream of tokens and making sure that was correct.




As an aside, I find the behavior of need_escapes=false to be
completely counterintuitive. I know the previous code did this, but it
seems like the opposite of "provides unescaped strings" should be
"provides raw strings", not "all strings are now NULL".



Yes, we could possibly call it "need_strings"  or something like that.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-02-22 Thread Jacob Champion
On Thu, Feb 22, 2024 at 1:38 AM Andrew Dunstan  wrote:
> Patch 5 in this series fixes those issues and
> adjusts most of the tests to add some trailing junk to the pieces of
> json, so we can be sure that this is done right.

This fixes the test failure for me, thanks! I've attached my current
mesonification diff, which just adds test_json_parser to the suite. It
relies on the PATH that's set up, which appears to include the build
directory for both VPATH builds and Meson.

Are there plans to fill out the test suite more? Since we should be
able to control all the initial conditions, it'd be good to get fairly
comprehensive coverage of the new code.

As an aside, I find the behavior of need_escapes=false to be
completely counterintuitive. I know the previous code did this, but it
seems like the opposite of "provides unescaped strings" should be
"provides raw strings", not "all strings are now NULL".

--Jacob
commit 590ea7bec167058340624313d98c72976fa89d7a
Author: Jacob Champion 
Date:   Wed Feb 21 06:36:55 2024 -0800

WIP: mesonify

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..e5c9bd10cf 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_dsm_registry')
 subdir('test_extensions')
 subdir('test_ginpostinglist')
 subdir('test_integerset')
+subdir('test_json_parser')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
new file mode 100644
index 00..a5bedce94e
--- /dev/null
+++ b/src/test/modules/test_json_parser/meson.build
@@ -0,0 +1,50 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_json_parser_incremental_sources = files(
+  'test_json_parser_incremental.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_incremental_sources += rc_bin_gen.process(win32ver_rc, 
extra_args: [
+'--NAME', 'test_json_parser_incremental',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_incremental = executable('test_json_parser_incremental',
+  test_json_parser_incremental_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)
+
+test_json_parser_perf_sources = files(
+  'test_json_parser_perf.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_perf_sources += rc_bin_gen.process(win32ver_rc, extra_args: 
[
+'--NAME', 'test_json_parser_perf',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_perf = executable('test_json_parser_perf',
+  test_json_parser_perf_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)
+
+tests += {
+  'name': 'test_json_parser',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'tests': [
+  't/001_test_json_parser_incremental.pl',
+],
+  },
+}
diff --git 
a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl 
b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
index fc9718baf3..8eeb7f5b91 100644
--- a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
+++ b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
@@ -8,7 +8,7 @@ use FindBin;
 
 my $test_file = "$FindBin::RealBin/../tiny.json";
 
-my $exe = "$ENV{TESTDATADIR}/../test_json_parser_incremental";
+my $exe = "test_json_parser_incremental";
 
 my ($stdout, $stderr) = run_command( [$exe, $test_file] );
 


Re: WIP Incremental JSON Parser

2024-02-21 Thread Jacob Champion
On Wed, Feb 21, 2024 at 6:50 AM Jacob Champion
 wrote:
> On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan  wrote:
> > *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
> > you give me details of the build? OS, compiler, path to source, build
> > setup etc.? Anything that might be remotely relevant.

This construction seems suspect, in json_lex_number():

>   if (lex->incremental && !lex->inc_state->is_last_chunk &&
>   len >= lex->input_length)
>   {
>   appendStringInfoString(>inc_state->partial_token,
>  lex->token_start);
>   return JSON_INCOMPLETE;
>   }

appendStringInfoString() isn't respecting the end of the chunk: if
there's extra data after the chunk boundary (as
AppendIncrementalManifestData() does) then all of that will be stuck
onto the end of the partial_token.

I'm about to context-switch off of this for the day, but I can work on
a patch tomorrow if that'd be helpful. It looks like this is not the
only call to appendStringInfoString().

--Jacob




Re: WIP Incremental JSON Parser

2024-02-21 Thread Jacob Champion
On Tue, Feb 20, 2024 at 9:32 PM Andrew Dunstan  wrote:
> *sigh* That's weird. I wonder why you can reproduce it and I can't. Can
> you give me details of the build? OS, compiler, path to source, build
> setup etc.? Anything that might be remotely relevant.

Sure:
- guest VM running in UTM (QEMU 7.2) is Ubuntu 22.04 for ARM, default
core count, 8GB
- host is macOS Sonoma 14.3.1, Apple Silicon (M3 Pro), 36GB
- it's a Meson build (plus a diff for the test utilities, attached,
but that's hopefully not relevant to the failure)
- buildtype=debug, optimization=g, cassert=true
- GCC 11.4
- build path is nested a bit (~/src/postgres/worktree-inc-json/build-dev)

--Jacob
commit 0075a88beec160cbb408d9a1e0a11d836fb55bdf
Author: Jacob Champion 
Date:   Wed Feb 21 06:36:55 2024 -0800

WIP: mesonify

diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..e5c9bd10cf 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_dsm_registry')
 subdir('test_extensions')
 subdir('test_ginpostinglist')
 subdir('test_integerset')
+subdir('test_json_parser')
 subdir('test_lfind')
 subdir('test_misc')
 subdir('test_oat_hooks')
diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
new file mode 100644
index 00..42eb670864
--- /dev/null
+++ b/src/test/modules/test_json_parser/meson.build
@@ -0,0 +1,39 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_json_parser_incremental_sources = files(
+  'test_json_parser_incremental.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_incremental_sources += rc_bin_gen.process(win32ver_rc, 
extra_args: [
+'--NAME', 'test_json_parser_incremental',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_incremental = executable('test_json_parser_incremental',
+  test_json_parser_incremental_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)
+
+test_json_parser_perf_sources = files(
+  'test_json_parser_perf.c',
+)
+
+if host_system == 'windows'
+  test_json_parser_perf_sources += rc_bin_gen.process(win32ver_rc, extra_args: 
[
+'--NAME', 'test_json_parser_perf',
+'--FILEDESC', 'standalone json parser tester',
+  ])
+endif
+
+test_json_parser_perf = executable('test_json_parser_perf',
+  test_json_parser_perf_sources,
+  dependencies: [frontend_code],
+  kwargs: default_bin_args + {
+'install': false,
+  },
+)


Re: WIP Incremental JSON Parser

2024-02-20 Thread Andrew Dunstan



On 2024-02-20 Tu 19:53, Jacob Champion wrote:

On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan  wrote:

Well, that didn't help a lot, but meanwhile the CFBot seems to have
decided in the last few days that it's now happy, so full steam aead! ;-)

I haven't been able to track down the root cause yet, but I am able to
reproduce the failure consistently on my machine:

 ERROR:  could not parse backup manifest: file size is not an
integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT",
"Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" },
 { "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20

Full log is attached.




*sigh* That's weird. I wonder why you can reproduce it and I can't. Can 
you give me details of the build? OS, compiler, path to source, build 
setup etc.? Anything that might be remotely relevant.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-02-20 Thread Jacob Champion
On Tue, Feb 20, 2024 at 2:10 PM Andrew Dunstan  wrote:
> Well, that didn't help a lot, but meanwhile the CFBot seems to have
> decided in the last few days that it's now happy, so full steam aead! ;-)

I haven't been able to track down the root cause yet, but I am able to
reproduce the failure consistently on my machine:

ERROR:  could not parse backup manifest: file size is not an
integer: '16384, "Last-Modified": "2024-02-20 23:07:43 GMT",
"Checksum-Algorithm": "CRC32C", "Checksum": "66c829cd" },
{ "Path": "base/4/2606", "Size": 24576, "Last-Modified": "20

Full log is attached.

--Jacob


regress_log_003_timeline.gz
Description: GNU Zip compressed data


Re: WIP Incremental JSON Parser

2024-02-19 Thread Andrew Dunstan



On 2024-01-26 Fr 12:15, Andrew Dunstan wrote:


On 2024-01-24 We 13:08, Robert Haas wrote:


Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.



Yeah, I thought earlier today I was on the track of an off by one 
error, but I was apparently mistaken, so here's the same patch set 
with an extra patch that logs a bunch of stuff, and might let us see 
what's upsetting the cfbot.






Well, that didn't help a lot, but meanwhile the CFBot seems to have 
decided in the last few days that it's now happy, so full steam aead! ;-)



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-01-24 Thread Robert Haas
On Wed, Jan 24, 2024 at 10:04 AM Andrew Dunstan  wrote:
> The cfbot reports an error on a 32 bit build 
> :
>
> # Running: pg_basebackup -D 
> /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2
>  --no-sync -cfast --incremental 
> /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
> pg_basebackup: error: could not upload manifest: ERROR:  could not parse 
> backup manifest: file size is not an integer
> pg_basebackup: removing data directory 
> "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
> [02:41:07.830](0.073s) not ok 2 - incremental backup from node1
> [02:41:07.830](0.000s) #   Failed test 'incremental backup from node1'
>
> I have set up a Debian 12 EC2 instance following the recipe at 
> ,
>  and ran what I think are the same tests dozens of times, but the failure did 
> not reappear in my setup. Unfortunately, the test doesn't show the failing 
> manifest or log the failing field, so trying to divine what happened here is 
> more than difficult.
>
> Not sure how to address this.

Yeah, that's really odd. The backup size field is printed like this:

appendStringInfo(, "\"Size\": %zu, ", size);

And parsed like this:

size = strtoul(parse->size, , 10);
if (*ep)
json_manifest_parse_failure(parse->context,

 "file size is not an integer");

I confess to bafflement -- how could the output of the first fail to
be parsed by the second? The manifest has to look pretty much valid in
order not to error out before it gets to this check, with just that
one field corrupted. But I don't understand how that could happen.

I agree that the error reporting could be better here, but it didn't
seem worth spending time on when I wrote the code. I figured the only
way we could end up with something like "Size": "Walrus" is if the
user was messing with us on purpose. Apparently that's not so, yet the
mechanism eludes me. Or maybe it's not some random string, but is
something like an empty string or a number with trailing garbage or a
number that's out of range. But I don't see how any of those things
can happen either.

Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-01-24 Thread Andrew Dunstan


On 2024-01-22 Mo 21:02, Andrew Dunstan wrote:


On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:


On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:


On 2024-01-22 Mo 01:29, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.



Thanks.

Let's see if the attached does better.




This time for sure! (Watch me pull a rabbit out of my hat!)


It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed 
it, and I think the test will now pass.






Fixed one problem but there are some others. I'm hoping this will 
satisfy the cfbot.






The cfbot reports an error on a 32 bit build 
:


# Running: pg_basebackup -D 
/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2
 --no-sync -cfast --incremental 
/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR:  could not parse backup 
manifest: file size is not an integer
pg_basebackup: removing data directory 
"/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) #   Failed test 'incremental backup from node1'

I have set up a Debian 12 EC2 instance following the recipe at 
, 
and ran what I think are the same tests dozens of times, but the failure 
did not reappear in my setup. Unfortunately, the test doesn't show the 
failing manifest or log the failing field, so trying to divine what 
happened here is more than difficult.


Not sure how to address this.


cheers


andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: WIP Incremental JSON Parser

2024-01-21 Thread Peter Smith
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4725/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725

Kind Regards,
Peter Smith.




Re: WIP Incremental JSON Parser

2024-01-10 Thread Andrew Dunstan



On 2024-01-09 Tu 13:46, Jacob Champion wrote:

On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan  wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.



Thanks for looking! I've been playing around with a similar idea, but 
yours might be better.






Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?



Also a good suggestion. Will look and see. IIRC I had trouble with this bit.





+   case JSON_SEM_SCALAR_CALL:
+   {
+   json_scalar_action sfunc = sem->scalar;
+
+   if (sfunc != NULL)
+   (*sfunc) (sem->semstate, scalar_val, scalar_tok);
+   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?



Good point. In fact it might be responsible for the error I'm currently 
trying to get to the bottom of.



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-01-09 Thread Jacob Champion
On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan  wrote:
> Quite a long time ago Robert asked me about the possibility of an
> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
> performance is significantly worse that that of the current Recursive
> Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.

Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?

> +   case JSON_SEM_SCALAR_CALL:
> +   {
> +   json_scalar_action sfunc = sem->scalar;
> +
> +   if (sfunc != NULL)
> +   (*sfunc) (sem->semstate, scalar_val, scalar_tok);
> +   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?

> One possible use would be in parsing large manifest files for
> incremental backup.

I'm keeping an eye on this thread for OAuth, since the clients have to
parse JSON as well. Those responses tend to be smaller, though, so
you'd have to really be hurting for resources to need this.

--Jacob
commit 79d0dc78b9f3b0bbc078876417b8f46970308e6e
Author: Jacob Champion 
Date:   Thu Jan 4 11:46:06 2024 -0800

WIP: try to speed up prediction

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index d875d57df7..9149d45a4b 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -165,39 +165,47 @@ static char JSON_PROD_MORE_KEY_PAIRS[] = 
{JSON_NT_MORE_KEY_PAIRS, JSON_NT_KEYPAI
  * Any combination not specified here represents an error.
  */
 
-static char *td_parser_table[JSON_NUM_NONTERMINALS][JSON_NUM_TERMINALS] =
+struct td_entry
+{
+   size_t  len;
+   char   *prod;
+};
+
+#define TD_ENTRY(PROD) { sizeof(PROD) - 1, (PROD) }
+
+static struct td_entry 
td_parser_table[JSON_NUM_NONTERMINALS][JSON_NUM_TERMINALS] =
 {
/* JSON */
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_STRING] = JSON_PROD_SCALAR_STRING,
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_NUMBER] = JSON_PROD_SCALAR_NUMBER,
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_TRUE] = JSON_PROD_SCALAR_TRUE,
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_FALSE] = JSON_PROD_SCALAR_FALSE,
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_NULL] = JSON_PROD_SCALAR_NULL,
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_ARRAY_START] = JSON_PROD_ARRAY,
-   [OFS(JSON_NT_JSON)][JSON_TOKEN_OBJECT_START] = JSON_PROD_OBJECT,
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_STRING] = 
TD_ENTRY(JSON_PROD_SCALAR_STRING),
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_NUMBER] = 
TD_ENTRY(JSON_PROD_SCALAR_NUMBER),
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_TRUE] = TD_ENTRY(JSON_PROD_SCALAR_TRUE),
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_FALSE] = 
TD_ENTRY(JSON_PROD_SCALAR_FALSE),
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_NULL] = TD_ENTRY(JSON_PROD_SCALAR_NULL),
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_ARRAY_START] = TD_ENTRY(JSON_PROD_ARRAY),
+   [OFS(JSON_NT_JSON)][JSON_TOKEN_OBJECT_START] = 
TD_ENTRY(JSON_PROD_OBJECT),
/* ARRAY_ELEMENTS */
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_ARRAY_START] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_OBJECT_START] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_STRING] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_NUMBER] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_TRUE] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_FALSE] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_NULL] = 
JSON_PROD_ARRAY_ELEMENTS,
-   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_ARRAY_END] = JSON_PROD_EPSILON,
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_ARRAY_START] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_OBJECT_START] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_STRING] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_NUMBER] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_TRUE] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_FALSE] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_NULL] = 
TD_ENTRY(JSON_PROD_ARRAY_ELEMENTS),
+   [OFS(JSON_NT_ARRAY_ELEMENTS)][JSON_TOKEN_ARRAY_END] = 
TD_ENTRY(JSON_PROD_EPSILON),
/* MORE_ARRAY_ELEMENTS */
-   

Re: WIP Incremental JSON Parser

2024-01-04 Thread Robert Haas
On Wed, Jan 3, 2024 at 6:36 PM Nico Williams  wrote:
> On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> > It seems like a pretty significant savings no matter what. Suppose the
> > backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> > create an 1MB buffer and feed the data to the parser in 1MB chunks.
> > Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> > you raise here in some way, we can potentially save even more memory,
> > which is great, but even if we don't, we still saved a bunch of memory
> > that could not have been saved in any other way.
>
> You could also build a streaming incremental parser.  That is, one that
> outputs a path and a leaf value (where leaf values are scalar values,
> `null`, `true`, `false`, numbers, and strings).  Then if the caller is
> doing something JSONPath-like then the caller can probably immediately
> free almost all allocations and even terminate the parse early.

I think our current parser is event-based rather than this ... but it
seems like this could easily be built on top of it, if someone wanted
to.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-01-03 Thread Nico Williams
On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> It seems like a pretty significant savings no matter what. Suppose the
> backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> create an 1MB buffer and feed the data to the parser in 1MB chunks.
> Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> you raise here in some way, we can potentially save even more memory,
> which is great, but even if we don't, we still saved a bunch of memory
> that could not have been saved in any other way.

You could also build a streaming incremental parser.  That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings).  Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.

Nico
-- 




Re: WIP Incremental JSON Parser

2024-01-03 Thread Andrew Dunstan



On 2024-01-03 We 10:12, Robert Haas wrote:

On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan  wrote:

Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.

I guess we can benchmark it and see what the performance impact of using
a hash table might be.

Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.




OK, cleaning up in the client code will be much simpler, so let's go 
with that for now and revisit it later if necessary.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-01-03 Thread Robert Haas
On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan  wrote:
> Say we have a document with an array 1m objects, each with a field
> called "color". As it stands we'll allocate space for that field name 1m
> times. Using a hash table we'd allocated space for it once. And
> allocating the memory isn't free, although it might be cheaper than
> doing hash lookups.
>
> I guess we can benchmark it and see what the performance impact of using
> a hash table might be.
>
> Another possibility would be simply to have the callback free the field
> name after use. for the parse_manifest code that could be a one-line
> addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-01-03 Thread Andrew Dunstan



On 2024-01-03 We 08:45, Robert Haas wrote:

On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan  wrote:

Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.



I don't quite follow.

Say we have a document with an array 1m objects, each with a field 
called "color". As it stands we'll allocate space for that field name 1m 
times. Using a hash table we'd allocated space for it once. And 
allocating the memory isn't free, although it might be cheaper than 
doing hash lookups.


I guess we can benchmark it and see what the performance impact of using 
a hash table might be.


Another possibility would be simply to have the callback free the field 
name after use. for the parse_manifest code that could be a one-line 
addition to the code at the bottom of json_object_manifest_field_start().




The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?



No I haven't. I have simply theorized about how much memory we might 
consume if nothing were done by the callers to free the memory.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-01-03 Thread Robert Haas
On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan  wrote:
> Yeah. One idea I had yesterday was to stash the field names, which in
> large JSON docs tent to be pretty repetitive, in a hash table instead of
> pstrduping each instance. The name would be valid until the end of the
> parse, and would only need to be duplicated by the callback function if
> it were needed beyond that. That's not the case currently with the
> parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.

> The parse_manifest code does seem to pfree the scalar values it no
> longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-01-03 Thread Andrew Dunstan



On 2024-01-02 Tu 10:14, Robert Haas wrote:

On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan  wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.


One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.



Yeah. One idea I had yesterday was to stash the field names, which in 
large JSON docs tent to be pretty repetitive, in a hash table instead of 
pstrduping each instance. The name would be valid until the end of the 
parse, and would only need to be duplicated by the callback function if 
it were needed beyond that. That's not the case currently with the 
parse_manifest code. I'll work on using a hash table.


The parse_manifest code does seem to pfree the scalar values it no 
longer needs fairly well, so maybe we don't need to to anything there.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: WIP Incremental JSON Parser

2024-01-02 Thread Robert Haas
On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan  wrote:
> Quite a long time ago Robert asked me about the possibility of an
> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
> performance is significantly worse that that of the current Recursive
> Descent parser. Nevertheless, I'm attaching my current WIP state for it,
> and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.

> One possible use would be in parsing large manifest files for
> incremental backup. However, it struck me a few days ago that this might
> not work all that well. The current parser and the new parser both
> palloc() space for each field name and scalar token in the JSON (unless
> they aren't used, which is normally not the case), and they don't free
> it, so that particularly if done in frontend code this amounts to a
> possible memory leak, unless the semantic routines do the freeing
> themselves. So while we can save some memory by not having to slurp in
> the whole JSON in one hit, we aren't saving any of that other allocation
> of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com