Re: Inconsistent text markup handling when double-nesting markers

2023-10-12 Thread Ihor Radchenko
Max Nikulin  writes:

> By the way, is it explicitly specified that within an element namely 
> top-down strategy must be used to recognize objects?

https://orgmode.org/worg/org-syntax.html has it, I think.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: Inconsistent text markup handling when double-nesting markers

2023-10-12 Thread Max Nikulin

On 11/10/2023 19:26, Ihor Radchenko wrote:

Max Nikulin writes:


P.S. Juan Manuel at certain moment discovered that pandoc allows nesting
for *b1 *b2* b3*.


Which is a bug in pandoc.

I think we discussed this topic a number of times in the past - our
markup is a compromise between simplicity for users and simplicity of
the parser. This works in many simple cases, but edge cases become
problematic.


I have no intention to raise discussions of changing patterns to 
recognize beginning and end of objects or extending of syntax.


My guess is that pandoc may use bottom-up, not top-down approach. I 
admit, my opinion may be biased by reading complains concerning 
unexpected behavior of current implementation. Perhaps besides 
advantages pandoc parser has downsides. I would not be surprised if 
bottom up parser is unbearable without some tool that generates code for 
provided rules.


By the way, is it explicitly specified that within an element namely 
top-down strategy must be used to recognize objects?




Re: Inconsistent text markup handling when double-nesting markers

2023-10-11 Thread Tom Alexander
> Fixed, on main.

Thanks!

--
Tom Alexander
pgp: https://fizz.buzz/pgp.asc



Re: Inconsistent text markup handling when double-nesting markers

2023-10-11 Thread Ihor Radchenko
Max Nikulin  writes:

>> No, **bold** it is not a bug. The parser is recursive with inner markup
>> not "seeing" its parent. So, we first parse the outer bold and then
>> continue parsing the contents separately, as *bold*.
>
> I just find the following rather confusing:
>
> (org-export-string-as "**bold**" 'html t)
> "\nbold\n"
> (org-export-string-as "**inner* outer*" 'html t)
> "\n*inner outer*\n"
> (org-export-string-as "*outer *inner**" 'html t)
> "\nouter inner\n"
> (org-export-string-as "*begin *inner* end*" 'html t)
> "\nbegin *inner end*\n"

Maybe. It is indeed one of the edge cases. But it is following the
parser logic, which is (1) first matching markup is parser; (2) parsing
recursive contents is isolated.

>> Be it another way, /*bold italic*/ would also not be allowed as
>> we demand bol, whitespace, -, (, {, ', or " before the markup:
>> https://orgmode.org/worg/org-syntax.html#Emphasis_Markers
>
> Certainly /*b*/ should work, but nested bold was a surprise for me. I 
> believed that nesting is strictly prohibited. The case of underscores is 
> even more tricky due to ambiguity of underline and subscript.

It is not strictly prohibited on purpose. It is just a consequence of
how the parser works that nesting  constructs is almost impossible,
except certain edge cases like **b**.

> P.S. Juan Manuel at certain moment discovered that pandoc allows nesting 
> for *b1 *b2* b3*.

Which is a bug in pandoc.

I think we discussed this topic a number of times in the past - our
markup is a compromise between simplicity for users and simplicity of
the parser. This works in many simple cases, but edge cases become
problematic.

Workarounds have been discussed as well. For example, creole markup and
generic inline markup constructs (your idea with direct AST and the idea
with inline special blocks).

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: Inconsistent text markup handling when double-nesting markers

2023-10-11 Thread Max Nikulin

On 11/10/2023 16:15, Ihor Radchenko wrote:

Max Nikulin  writes:


Isn't nested bold for "**bold**" a bug? Generally it is not allowed and

   *b1 *b2* b3*

is parsed as bold only for "b1 *b2".


No, **bold** it is not a bug. The parser is recursive with inner markup
not "seeing" its parent. So, we first parse the outer bold and then
continue parsing the contents separately, as *bold*.


I just find the following rather confusing:

(org-export-string-as "**bold**" 'html t)
"\nbold\n"
(org-export-string-as "**inner* outer*" 'html t)
"\n*inner outer*\n"
(org-export-string-as "*outer *inner**" 'html t)
"\nouter inner\n"
(org-export-string-as "*begin *inner* end*" 'html t)
"\nbegin *inner end*\n"


Be it another way, /*bold italic*/ would also not be allowed as
we demand bol, whitespace, -, (, {, ', or " before the markup:
https://orgmode.org/worg/org-syntax.html#Emphasis_Markers


Certainly /*b*/ should work, but nested bold was a surprise for me. I 
believed that nesting is strictly prohibited. The case of underscores is 
even more tricky due to ambiguity of underline and subscript.


P.S. Juan Manuel at certain moment discovered that pandoc allows nesting 
for *b1 *b2* b3*.






Re: Inconsistent text markup handling when double-nesting markers

2023-10-11 Thread Ihor Radchenko
Max Nikulin  writes:

> Isn't nested bold for "**bold**" a bug? Generally it is not allowed and
>
>   *b1 *b2* b3*
>
> is parsed as bold only for "b1 *b2".

No, **bold** it is not a bug. The parser is recursive with inner markup
not "seeing" its parent. So, we first parse the outer bold and then
continue parsing the contents separately, as *bold*.

Be it another way, /*bold italic*/ would also not be allowed as
we demand bol, whitespace, -, (, {, ', or " before the markup:
https://orgmode.org/worg/org-syntax.html#Emphasis_Markers

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: Inconsistent text markup handling when double-nesting markers

2023-10-10 Thread Max Nikulin

On 10/10/2023 19:07, Ihor Radchenko wrote:

"Tom Alexander" writes:


I used the following test document:
```
__foo__

**foo**
```


Fixed, on main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=fe23bec60


Isn't nested bold for "**bold**" a bug? Generally it is not allowed and

 *b1 *b2* b3*

is parsed as bold only for "b1 *b2".





Re: Inconsistent text markup handling when double-nesting markers

2023-10-10 Thread Ihor Radchenko
"Tom Alexander"  writes:

> I used the following test document:
> ```
> __foo__
>
> **foo**
> ```
>
> I'd expect the two to behave the same but the first one parses as:
> ```
> (paragraph
>   "_"
>   (subscript "foo")
>   "__"
>   )
> ```

Fixed, on main.
https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=fe23bec60

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Inconsistent text markup handling when double-nesting markers

2023-10-09 Thread Tom Alexander
I used the following test document:
```
__foo__

**foo**
```

I'd expect the two to behave the same but the first one parses as:
```
(paragraph
  "_"
  (subscript "foo")
  "__"
  )
```

Whereas the second parses as:
```
(paragraph
  (bold
(bold
  "foo"
  )
)
  )
```

This pattern happens in worg at [2]

Looking at the description for text markup in the syntax document[1], I don't 
see any reason the first wouldn't be parsed as an underline:

1. PRE: valid because it is the beginning of a line
2. MARKER: valid underscore
3. CONTENTS: valid. Series of objects from standard set includes both subscript 
and text markup, so regardless of how we parse the interior, its valid. Also 
cannot begin or end with whitespace but there is no whitespace in the CONTENTS.
4. MARKER: valid underscore
5. POST: Only valid if we extend the underline to the 2nd underscore so it ends 
at the end of the line. But the 2nd line shows us that having copies of the 
marker inside the CONTENTS is fine so I see two possible expected parses of the 
CONTENTS:
4a. (underline "foo")
4b. ((subscript "foo") (plain-text "_"))

I also ran the following test document to further prove that having copies of 
the marker inside the CONTENTS is fine:
```
*foo*bar*
```
which parses as (bold "foo*bar")

So the only way the top line would fail to parse as an underline is if it 
matched the first closing underscore as closing the underline, but that would 
be invalid because underscore is not a valid POST character and invalid copies 
of the closing marker are ignored as proven by both "**foo**" and "*foo*bar*".


[1] https://orgmode.org/worg/org-syntax.html#Emphasis_Markers
[2] 
https://git.sr.ht/~bzg/worg/tree/ba6cda890f200d428a5d68e819eef15b5306055f/org-contrib/babel/intro.org#L117

--
Tom Alexander
pgp: https://fizz.buzz/pgp.asc