Re: Q: Escapes in jsonpath Idents

2024-04-24 Thread David E. Wheeler
On Apr 24, 2024, at 3:22 PM, Erik Wienhold  wrote:

> Thanks Peter!  But what is the definition of the entire path expression?
> Perhaps something like:
> 
> ::=  { "."  }
> 
> That would imply that "$.$foo" is a valid path that accesses a variable
> member (but I guess the path evaluation is also specified somewhere).

I read it as “if it starts with a dollar sign, it’s a variable and not a path 
identifier”, and I assume any `.foo` expression is a path identifier.

> What bugs me about this description, after reading it a couple of times,
> is that it's not clear what is meant by ."$varname".  It could mean two
> things: (1) the double-quoting masks $varname in order to not interpret
> those characters as a variable or (2) an interpolated string that
> resolves $varname and yields a dynamic member accessor.

My understanding is that if it’s in double quotes it’s never anything other 
than a string (whether a string literal or a path identifier string literal). 
IOW, variables don’t interpolate inside strings.

> Under case (2) I'd expect that query to return 456 (because $foo
> resolves to "bar").  (Similar to how psql would resolve :'foo' to
> 'bar'.)

Yes, I suspect this is the correct interpretation, but agree the wording could 
use some massaging, especially since path identifiers cannot start with a 
dollar sign anyway. Perhaps:

"If the key name starts with $ or does not meet the JavaScript rules for an 
identifier, it must be enclosed in double quotes to make it a string literal."

> Variables already work in array accessors and table 8.25 says that "The
> specified index can be an integer, as well as an expression returning a
> single numeric value [...]".  A variable is such an expression.
> 
>=> select jsonb_path_query('[2,3,5]', '$[$i]', '{"i":1}');
> jsonb_path_query
>--
> 3
>(1 row)
> 
> So I'd expect a similar behavior for member accessors as well when
> seeing ."$varname" in the same table.

Oh, interesting point! Now I wonder if the standard has this inconsistency (and 
is aware of it).

> Yes, I think so.  That would be case C in the spec excerpt provided by
> Peter.  So it's just a key name that happens to contain (but not start
> with) the dollar sign.

Exactly. It also matches the doc you quote above. Something would have to 
change in src/backend/utils/adt/jsonpath_scan.l to fix that, but that file 
makes my eyes water, so I’m not gonna take a stab at it. :-)

D





Re: Q: Escapes in jsonpath Idents

2024-04-24 Thread Erik Wienhold
On 2024-04-24 13:52 +0200, David E. Wheeler wrote:
> On Apr 24, 2024, at 05:51, Peter Eisentraut  wrote:
> 
> >A  is classified as follows.
> > 
> >Case:
> > 
> >a) A  that is a  is a  >   path context variable>.
> > 
> >b) A  that begins with  is a
> >   .
> > 
> >c) Otherwise, a  is a .
> > 
> > Does this help?  I wasn't following all the discussion to see if
> > there is anything wrong with the implementation.

Thanks Peter!  But what is the definition of the entire path expression?
Perhaps something like:

 ::=  { "."  }

That would imply that "$.$foo" is a valid path that accesses a variable
member (but I guess the path evaluation is also specified somewhere).

Does it say anything about double-quoted accessors?  In table 8.25[1] we
allow member accessor ."$varname" and it says "If the key name matches
some named variable starting with $ or does not meet the JavaScript
rules for an identifier, it must be enclosed in double quotes to make it
a string literal."

What bugs me about this description, after reading it a couple of times,
is that it's not clear what is meant by ."$varname".  It could mean two
things: (1) the double-quoting masks $varname in order to not interpret
those characters as a variable or (2) an interpolated string that
resolves $varname and yields a dynamic member accessor.

The current implementation supports (1), i.e., ."$foo" does not refer to
variable foo but the actual property "$foo":

=> select jsonb_path_query('{"$foo":123,"bar":456}', '$."$foo"', 
'{"foo":"bar"}');
 jsonb_path_query
--
 123
(1 row)

Under case (2) I'd expect that query to return 456 (because $foo
resolves to "bar").  (Similar to how psql would resolve :'foo' to
'bar'.)

Variables already work in array accessors and table 8.25 says that "The
specified index can be an integer, as well as an expression returning a
single numeric value [...]".  A variable is such an expression.

=> select jsonb_path_query('[2,3,5]', '$[$i]', '{"i":1}');
 jsonb_path_query
--
 3
(1 row)

So I'd expect a similar behavior for member accessors as well when
seeing ."$varname" in the same table.

> Yes, it does, as it ties the special meaning of the dollar sign to the
> *beginning* of an expression. So it makes sense that this would be an
> error:
> 
> david=# select '$.$foo'::jsonpath;
> ERROR: syntax error at or near "$foo" of jsonpath input
> LINE 1: select '$.$foo'::jsonpath;
>^
> But I’m less sure when a dollar sign is used in the *middle* (or end)
> of a json path identifier:
> 
> david=# select '$.xx$foo'::jsonpath;
> ERROR:  syntax error at or near "$foo" of jsonpath input
> LINE 1: select '$.xx$foo'::jsonpath;
>^
> Perhaps that should be valid?

Yes, I think so.  That would be case C in the spec excerpt provided by
Peter.  So it's just a key name that happens to contain (but not start
with) the dollar sign.

[1] 
https://www.postgresql.org/docs/current/datatype-json.html#TYPE-JSONPATH-ACCESSORS

-- 
Erik




Re: Q: Escapes in jsonpath Idents

2024-04-24 Thread David E. Wheeler
On Apr 24, 2024, at 05:46, Peter Eisentraut  wrote:

> I have committed this patch, and backpatched it, as a bug fix, because the 
> existing description was wrong.  To keep the patch minimal for backpatching, 
> I didn't do the conversion to a list.  I'm not sure I like that anyway, 
> because it tends to draw more attention to that part over the surrounding 
> parts, which didn't seem appropriate in this case.  But anyway, if you have 
> any more non-bug-fix editing in this area, which would then target PG18, 
> please send more patches.

Makes sense, that level of detail gets into the weeks so maybe doesn’t need to 
be quite so prominent as a list. Thank you!

David





Re: Q: Escapes in jsonpath Idents

2024-04-24 Thread David E. Wheeler
On Apr 24, 2024, at 05:51, Peter Eisentraut  wrote:

>A  is classified as follows.
> 
>Case:
> 
>a) A  that is a  is acontext variable>.
> 
>b) A  that begins with  is a
>   .
> 
>c) Otherwise, a  is a .
> 
> Does this help?  I wasn't following all the discussion to see if there is 
> anything wrong with the implementation.

Yes, it does, as it ties the special meaning of the dollar sign to the 
*beginning* of an expression. So it makes sense that this would be an error:

david=# select '$.$foo'::jsonpath;
ERROR: syntax error at or near "$foo" of jsonpath input
LINE 1: select '$.$foo'::jsonpath;
   ^

But I’m less sure when a dollar sign is used in the *middle* (or end) of a json 
path identifier:

david=# select '$.xx$foo'::jsonpath;
ERROR:  syntax error at or near "$foo" of jsonpath input
LINE 1: select '$.xx$foo'::jsonpath;
   ^

Perhaps that should be valid?

Best,

David





Re: Q: Escapes in jsonpath Idents

2024-04-24 Thread Peter Eisentraut

On 18.03.24 01:09, Erik Wienhold wrote:

The error message 'syntax error at or near "$oo" of jsonpath input' for
the second case ($.f$oo), however, looks as if the scanner identifies
'$oo' as a variable instead of contiuing the scan of identifier (f$oo)
for the member accessor.  Looks like a bug to me because a variable
doesn't even make sense in that place.

Right. Maybe the docs should be updated to say that a literal dollar
sign isn’t supported in identifiers, unlike in JavaScript, except
through escapes like this:

Unfortunately, I don't have access to that part of the SQL spec.  So I
don't know how the jsonpath grammar is specified.


The SQL spec says that  corresponds to Identifier 
in ECMAScript.


But it also says,

A  is classified as follows.

Case:

a) A  that is a  is a .

b) A  that begins with  is a
   .

c) Otherwise, a  is a .

Does this help?  I wasn't following all the discussion to see if there 
is anything wrong with the implementation.






Re: Q: Escapes in jsonpath Idents

2024-04-24 Thread Peter Eisentraut

On 17.03.24 20:12, Erik Wienhold wrote:

Mentioning JSON and \v in the same sentence is wrong: JavaScript allows
that escape in strings but JSON doesn't.  I think the easiest is to just
replace "JSON" with "JavaScript" in that sentence to make it right.  The
paragraph also already says "embedded string literals follow JavaScript/
ECMAScript conventions", so mentioning JSON seems unnecessary to me.

The last sentence also mentions backslash escapes \xNN and \u{N...} as
deviations from JSON when in fact those are valid escape sequences from
ECMA-262:https://262.ecma-international.org/#prod-HexEscapeSequence
So I think it makes sense to reword the entire backslash part of the
paragraph and remove references to JSON entirely.  The attached patch
does that and also formats the backslash escapes as a bulleted list for
readability.


I have committed this patch, and backpatched it, as a bug fix, because 
the existing description was wrong.  To keep the patch minimal for 
backpatching, I didn't do the conversion to a list.  I'm not sure I like 
that anyway, because it tends to draw more attention to that part over 
the surrounding parts, which didn't seem appropriate in this case.  But 
anyway, if you have any more non-bug-fix editing in this area, which 
would then target PG18, please send more patches.






Re: Q: Escapes in jsonpath Idents

2024-03-19 Thread David E. Wheeler
On Mar 17, 2024, at 20:09, Erik Wienhold  wrote:
> 
> On 2024-03-17 20:50 +0100, David E. Wheeler wrote:
>> On Mar 17, 2024, at 15:12, Erik Wienhold  wrote:
>>> So I think it makes sense to reword the entire backslash part of the
>>> paragraph and remove references to JSON entirely.  The attached patch
>>> does that and also formats the backslash escapes as a bulleted list for
>>> readability.
>> 
>> Ah, it’s JavaScript format, not JSON! This does clarify things quite
>> nicely, thank you. Happy to add my review once it’s in a commit fest.
> 
> Thanks.  https://commitfest.postgresql.org/48/4899/

Applies cleanly, `make -C doc/src/sgml check` runs without error. Doc 
improvement welcome and much clearer than before.

> I had the same reasoning while writing my first reply but scrapped that
> part because I found it obvious:  That jsonpath is parsed before calling
> jsonb_path_exists() and therefore the parser has no context about any
> variables, which might not even be hardcoded but may result from a
> query.

Right, there’s a chicken/egg problem.

> Unfortunately, I don't have access to that part of the SQL spec.  So I
> don't know how the jsonpath grammar is specified.

Seems quite logical; I think it should be documented, but I’d also be 
interested to know what the 2016 and 2023 standards say, exactly.

> Also checked git log src/backend/utils/adt/jsonpath_scan.l for some
> insights but haven't found any yet.

Everybody’s taking shortcuts relative to the standard, AFAICT. For example, 
jsonpath_scan.l matches unqouted identifiers with these two regular expressions:

{other}+
\/\*
\\.

Plus the backslash escapes. {other} is defined as:

/* "other" means anything that's not special, blank, or '\' or '"' */
other [^\?\%\$\.\[\]\{\}\(\)\|\&\!\=\<\>\@\#\,\*:\-\+\/\\\" \t\n\r\f]

Which is wy more liberal than the ECMA standard[1], by my reading, but the 
MSDN[2] description is quite succinct (thanks for the links!):

> In JavaScript, identifiers are commonly made of alphanumeric characters, 
> underscores (_), and dollar signs ($). Identifiers are not allowed to start 
> with numbers. However, JavaScript identifiers are not only limited to ASCII — 
> many Unicode code points are allowed as well. Namely, any character in the 
> ID_Start category can start an identifier, while any character in the 
> ID_Continue category can appear after the first character.


ID_Start[3] and ID_Continue[4] point to the unicode standard codes lister, 
nether of which reference Emoji. Sure enough, in Safari:

> x = {"": true}
< {: true}
> x.
< SyntaxError: Invalid character '\ud83c’

But in Postgres jsonpath:

david=# select '$.'::jsonpath;
 jsonpath 
--
 $.""

If the MSDN references to ID_Start and ID_Continue are correct, then the 
Postgres path parser is being overly-liberal. Maybe that’s totally fine? Not 
sure what should be documented and what’s not worth it.

Aside: I’m only digging into these details because I’m busy porting the path 
parser, so trying to figure out where to be compatible and where not to. So far 
I’m rejecting '$' (but allowing '\$' and '\u0024') but taking advantage of the 
unicode support in Go to specifically validate against ID_Start and ID_Continue.

Best,

David

[1] https://262.ecma-international.org/#sec-identifier-names
[2] 
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers
[3] 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Start%7D
[4] 
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BID_Continue%7D











Re: Q: Escapes in jsonpath Idents

2024-03-17 Thread Erik Wienhold
On 2024-03-17 20:50 +0100, David E. Wheeler wrote:
> On Mar 17, 2024, at 15:12, Erik Wienhold  wrote:
> > So I think it makes sense to reword the entire backslash part of the
> > paragraph and remove references to JSON entirely.  The attached patch
> > does that and also formats the backslash escapes as a bulleted list for
> > readability.
> 
> Ah, it’s JavaScript format, not JSON! This does clarify things quite
> nicely, thank you. Happy to add my review once it’s in a commit fest.

Thanks.  https://commitfest.postgresql.org/48/4899/

> > The first case ($.$foo) is in line with the restriction on member
> > accessors that you quoted first.
> 
> Huh, that’s now how I read it. Here it is again:
> 
> >> Member accessor that returns an object member with the specified
> >> key. If the key name matches some named variable starting with $ or
> >> does not meet the JavaScript rules for an identifier, it must be
> >> enclosed in double quotes to make it a string literal.
> 
> 
> Note that in my example `$foo` does not match a variable. I mean it
> looks like a variable, but none is used here. I guess it’s being
> conservative because it might be used in one of the functions, like
> jsonb_path_exists(), to which variables might be passed.

I had the same reasoning while writing my first reply but scrapped that
part because I found it obvious:  That jsonpath is parsed before calling
jsonb_path_exists() and therefore the parser has no context about any
variables, which might not even be hardcoded but may result from a
query.

> > The error message 'syntax error at or near "$oo" of jsonpath input' for
> > the second case ($.f$oo), however, looks as if the scanner identifies
> > '$oo' as a variable instead of contiuing the scan of identifier (f$oo)
> > for the member accessor.  Looks like a bug to me because a variable
> > doesn't even make sense in that place.
> 
> Right. Maybe the docs should be updated to say that a literal dollar
> sign isn’t supported in identifiers, unlike in JavaScript, except
> through escapes like this:

Unfortunately, I don't have access to that part of the SQL spec.  So I
don't know how the jsonpath grammar is specified.

I had a look into Oracle, MySQL, and SQLite docs to see what they
implement:

* Oracle requires the unquoted field names to match [A-Za-z][A-Za-z0-9]*
  (see "object steps").  It also supports variables.
  
https://docs.oracle.com/en/database/oracle/oracle-database/23/adjsn/json-path-expressions.html

* MySQL refers to ECMAScript identifiers but does not say anything about
  variables: https://dev.mysql.com/doc/refman/8.3/en/json.html#json-path-syntax

* SQLite skimps on details and does not document a grammar:
  https://sqlite.org/json1.html#path_arguments
  But it looks as if it strives for compatibility with MySQL and our dear
  Postgres: https://sqlite.org/src/doc/json-in-core/doc/json-enhancements.md

Also checked git log src/backend/utils/adt/jsonpath_scan.l for some
insights but haven't found any yet.

-- 
Erik




Re: Q: Escapes in jsonpath Idents

2024-03-17 Thread David E. Wheeler
On Mar 17, 2024, at 15:12, Erik Wienhold  wrote:

> Hi David,

Hey Erik. Thanks for the detailed reply and patch!

> So I think it makes sense to reword the entire backslash part of the
> paragraph and remove references to JSON entirely.  The attached patch
> does that and also formats the backslash escapes as a bulleted list for
> readability.

Ah, it’s JavaScript format, not JSON! This does clarify things quite nicely, 
thank you. Happy to add my review once it’s in a commit fest.

> The first case ($.$foo) is in line with the restriction on member
> accessors that you quoted first.

Huh, that’s now how I read it. Here it is again:

>> Member accessor that returns an object member with the specified
>> key. If the key name matches some named variable starting with $ or
>> does not meet the JavaScript rules for an identifier, it must be
>> enclosed in double quotes to make it a string literal.


Note that in my example `$foo` does not match a variable. I mean it looks like 
a variable, but none is used here. I guess it’s being conservative because it 
might be used in one of the functions, like jsonb_path_exists(), to which 
variables might be passed.

> The error message 'syntax error at or near "$oo" of jsonpath input' for
> the second case ($.f$oo), however, looks as if the scanner identifies
> '$oo' as a variable instead of contiuing the scan of identifier (f$oo)
> for the member accessor.  Looks like a bug to me because a variable
> doesn't even make sense in that place.

Right. Maybe the docs should be updated to say that a literal dollar sign isn’t 
supported in identifiers, unlike in JavaScript, except through escapes like 
this:

> What works though, besides double quoting, is escaping the dollar sign:
> 
> regress=# select '$.\u0024foo'::jsonpath;
> jsonpath
> --
> $."$foo"
> (1 row)
> 
> And we've come full circle :)



Best,

David






Re: Q: Escapes in jsonpath Idents

2024-03-17 Thread Erik Wienhold
Hi David,

On 2024-03-16 19:39 +0100, David E. Wheeler wrote:
> The jsonpath doc[1] has an excellent description of the format of
> strings, but for unquoted path keys, it simply says:
> 
> > Member accessor that returns an object member with the specified
> > key. If the key name matches some named variable starting with $ or
> > does not meet the JavaScript rules for an identifier, it must be
> > enclosed in double quotes to make it a string literal.
> 
> I went looking for the JavaScript rules for an identifier and found
> this in the MDN docs[2]:
> 
> > In JavaScript, identifiers can contain Unicode letters, $, _, and
> > digits (0-9), but may not start with a digit. An identifier differs
> > from a string in that a string is data, while an identifier is part
> > of the code. In JavaScript, there is no way to convert identifiers
> > to strings, but sometimes it is possible to parse strings into
> > identifiers.
> 
> 
> However, the Postgres parsing of jsonpath keys appears to follow the
> same rules as strings, allowing backslash escapes:
> 
> david=# select '$.fo\u00f8 == $x'::jsonpath;
>  jsonpath   ---
>  ($."foø" == $"x")
> 
> This would seem to contradict the documentation. Is this behavior
> required by the SQL standard? Do the docs need updating? Or should the
> code actually follow the JSON identifier behavior?

That quoted MDN page does not give the whole picture.  ECMAScript and JS
do allow Unicode escape sequences in identifier names:

https://262.ecma-international.org/#sec-identifier-names
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers

> PS: Those excellent docs on strings mentions support for \v, but the
> grammar in the right nav of https://www.json.org/json-en.html does
> not. Another bonus feature?

You refer to that sentence: "Other special backslash sequences include
those recognized in JSON strings: \b, \f, \n, \r, \t, \v for various
ASCII control characters, and \u for a Unicode character identified
by its 4-hex-digit code point."

Mentioning JSON and \v in the same sentence is wrong: JavaScript allows
that escape in strings but JSON doesn't.  I think the easiest is to just
replace "JSON" with "JavaScript" in that sentence to make it right.  The
paragraph also already says "embedded string literals follow JavaScript/
ECMAScript conventions", so mentioning JSON seems unnecessary to me.

The last sentence also mentions backslash escapes \xNN and \u{N...} as
deviations from JSON when in fact those are valid escape sequences from
ECMA-262: https://262.ecma-international.org/#prod-HexEscapeSequence
So I think it makes sense to reword the entire backslash part of the
paragraph and remove references to JSON entirely.  The attached patch
does that and also formats the backslash escapes as a bulleted list for
readability.

> [1]: https://www.postgresql.org/docs/16/datatype-json.html#DATATYPE-JSONPATH
> [2]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier

On 2024-03-16 21:33 +0100, David E. Wheeler wrote:
> On Mar 16, 2024, at 14:39, David E. Wheeler 
> wrote:
> 
> > I went looking for the JavaScript rules for an identifier and found
> > this in the MDN docs[2]:
> > 
> >> In JavaScript, identifiers can contain Unicode letters, $, _, and
> >> digits (0-9), but may not start with a digit. An identifier differs
> >> from a string in that a string is data, while an identifier is part
> >> of the code. In JavaScript, there is no way to convert identifiers
> >> to strings, but sometimes it is possible to parse strings into
> >> identifiers.
> 
> Coda: Dollar signs don’t work at all outside double-quoted string
> identifiers:
> 
> david=# select '$.$foo'::jsonpath;
> ERROR:  syntax error at or near "$foo" of jsonpath input
> LINE 1: select '$.$foo'::jsonpath;
>^
> 
> david=# select '$.f$oo'::jsonpath;
> ERROR:  syntax error at or near "$oo" of jsonpath input
> LINE 1: select '$.f$oo'::jsonpath;
>^
> 
> david=# select '$."$foo"'::jsonpath;
>  jsonpath 
> --
>  $."$foo"
> 
> This, too, contradicts the MDM definition an identifier (and some
> quick browser tests).

The first case ($.$foo) is in line with the restriction on member
accessors that you quoted first.

The error message 'syntax error at or near "$oo" of jsonpath input' for
the second case ($.f$oo), however, looks as if the scanner identifies
'$oo' as a variable instead of contiuing the scan of identifier (f$oo)
for the member accessor.  Looks like a bug to me because a variable
doesn't even make sense in that place.

What works though, besides double quoting, is escaping the dollar sign:

regress=# select '$.\u0024foo'::jsonpath;
 jsonpath
--
 $."$foo"
(1 row)

And we've come full circle :)

-- 
Erik
>From a2bade71867aecbea90c7c03f0295cecca0c215d Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Sun, 17 Mar 2024 19:28:07 +0100
Subject: [PATCH v1] Simplify docs on 

Re: Q: Escapes in jsonpath Idents

2024-03-16 Thread David E. Wheeler
On Mar 16, 2024, at 14:39, David E. Wheeler  wrote:

> I went looking for the JavaScript rules for an identifier and found this in 
> the MDN docs[2]:
> 
>> In JavaScript, identifiers can contain Unicode letters, $, _, and digits 
>> (0-9), but may not start with a digit. An identifier differs from a string 
>> in that a string is data, while an identifier is part of the code. In 
>> JavaScript, there is no way to convert identifiers to strings, but sometimes 
>> it is possible to parse strings into identifiers.

Coda: Dollar signs don’t work at all outside double-quoted string identifiers:

david=# select '$.$foo'::jsonpath;
ERROR:  syntax error at or near "$foo" of jsonpath input
LINE 1: select '$.$foo'::jsonpath;
   ^

david=# select '$.f$oo'::jsonpath;
ERROR:  syntax error at or near "$oo" of jsonpath input
LINE 1: select '$.f$oo'::jsonpath;
   ^

david=# select '$."$foo"'::jsonpath;
 jsonpath 
--
 $."$foo"

This, too, contradicts the MDM definition an identifier (and some quick browser 
tests).

Best,

David