Re: Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements)

2023-07-18 Thread Ihor Radchenko
Tom Gillespie  writes:

>> We might probably generalize to
>> PRE  = Zs Zl Pc Pd Ps Pi ' "
>> POST = Zs Zl Pc Pd Pe Pf . ; : ! ? ' " \ [
>
> If this works I think it is reasonable. We might want to
> specify what to do in cases where an org implementation
> might not fully support unicode,

Just fall back to ASCII subset? If the implementation does not support
unicode, it probably cannot properly work with UTF-encoded documents
anyway.

> ...and might want to do a
> review of related issues in syntax with respect to ascii
> vs unicode, because iirc there is some ambiguity in
> the current syntax doc.
> For example, I'm pretty sure that I'm mixing and matching
> unicode and ascii whitespace in the tokenizer I have in Racket.

Feel free to open new bug reports about such ambiguities.

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements)

2023-07-17 Thread Tom Gillespie
> We might probably generalize to
> PRE  = Zs Zl Pc Pd Ps Pi ' "
> POST = Zs Zl Pc Pd Pe Pf . ; : ! ? ' " \ [

If this works I think it is reasonable. We might want to
specify what to do in cases where an org implementation
might not fully support unicode, and might want to do a
review of related issues in syntax with respect to ascii
vs unicode, because iirc there is some ambiguity in
the current syntax doc.

For example, I'm pretty sure that I'm mixing and matching
unicode and ascii whitespace in the tokenizer I have in Racket.

> Though we need to take care excluding zero-width spaces.

Ya, I removed a comment to this effect in the paragraph about
the usual alternate solution.

> Emacs does not support them though (yet?).

Racket has full support for the latest unicode standards iirc,
so I will see if I can leverage that support for testing in laundry.

> At the end, it is the current ASCII limitation plus partially arbitrary
> choice of boundaries that keep some users confused (we are getting bug
> reports about confusing markup from time to time).

Ya, it would be good to try to generalize the affordance if possible since
users of text in non-ascii languages have certain valid expectations. Hopefully,
the unicode consortium has managed to cover the categories we need.



Re: Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements)

2023-07-17 Thread Ihor Radchenko
Tom Gillespie  writes:

> The way I have implemented this is by maintaining an explicit list of
> characters that are safe for pre markup and another for post markup.
>
> It is not possible to use unicode punctuation for this because there
> are a variety of punctuation marks that cannot appear in that position
> and be considered markup, those include @, #, % to name just a few.

Not that bad.
Unicode standard defines the following categories (I listed those that
might be of use):

Pc = Punctuation, connector
Pd = Punctuation, dash
Ps = Punctuation, open
Pe = Punctuation, close
Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage)
Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage)
Po = Punctuation, other
Zs = Separator, space
Zl = Separator, line
Zp = Separator, paragraph

We currently use the following:
PRE  =   - ( ' " {
POST =  - . ; : ! ? ' " ) } \ [

At least, ({ have
(get-char-code-property ?{ 'general-category) ;=> Ps (punctuation, open)

We might probably generalize to
PRE  = Zs Zl Pc Pd Ps Pi ' "
POST = Zs Zl Pc Pd Pe Pf . ; : ! ? ' " \ [

Though we need to take care excluding zero-width spaces.

I can find https://www.unicode.org/review/pr-23.html that defines
punctuation terminals like .;:!?
It looks like it is adopted, via special properties:
https://www.unicode.org/reports/tr44/#STerm and
https://www.unicode.org/reports/tr44/#Terminal_Punctuation

Emacs does not support them though (yet?).

> Therefore, if we want to do this we commit to extending and then
> maintaining the lists of valid pre and post markup delimiters as
> special cases.

We certainly do not want to do this. It is out of scope of Org, when
Unicode can be of use.

> Note also this could produce changes from current behavior because
> things that previously tokenized as a series of words connected by
> e.g. underscores could become markup.

Indeed. And we should study the feedback.
However, most scenarios that will change will involve non-standard
Unicode markup characters. The odds are low that users will use such
Unicode at markup boundary and _also expect markup to be ignored_. At
the end, it is the current ASCII limitation plus partially arbitrary
choice of boundaries that keep some users confused (we are getting bug
reports about confusing markup from time to time).

Of course, we can, as usual, provide a linter to catch such scenarios
and warn in the ORG_NEWS.

I do believe that better Unicode support will benefit many Org users
that use non-Latin scripts. 

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements)

2023-07-17 Thread Tom Gillespie
Hi Ihor,
   Thank you for looping me in. Best,
Tom

The way I have implemented this is by maintaining an explicit list of
characters that are safe for pre markup and another for post markup.

It is not possible to use unicode punctuation for this because there
are a variety of punctuation marks that cannot appear in that position
and be considered markup, those include @, #, % to name just a few.

Therefore, if we want to do this we commit to extending and then
maintaining the lists of valid pre and post markup delimiters as
special cases.

Note also this could produce changes from current behavior because
things that previously tokenized as a series of words connected by
e.g. underscores could become markup.

The alternative would be (as usual in these cases) for the user to
add a zero width space or something like that between the end of the
markup marker and the symbol they want to follow the markup. This
solution is (trivially) backward compatible, and works for all chars
regardless of whether org-mode has blessed them as sanctioned marks.

My inclination would be not to make this change because there are a
potentially infinite number of future "left right neutral" marks
that we would have to maintain and would occasionally have to field
requests from users to add them, and those solutions would not work
with older versions of org.



Org markup and non-ASCII punctuation (was: org parser and priorities of inline elements)

2023-07-17 Thread Ihor Radchenko
Max Nikulin  writes:

> On 21/11/2021 16:28, Ihor Radchenko wrote:
>> 
>> Also, is there any reason why we are not simply using punctuation
>> character class instead of listing punctuation chars explicitly (and
>> only for English)? What about "_你叫什么名字_?"
>
> It seems punctuation character class is too broad. E.g.
>  ¿ INVERTED QUESTION MARK
> normally appears before words, while "?" is usually after them. I do not 
> see anything special in
>  (category-set-mnemonics (char-category-set ?¿))
> that may help to discriminate such cases.

The last resort is define-category where we can manage exceptions.
But I think that even without distinguishing ?¿, we can improve the
situation for CJK users a lot.

We can probably split character categories into "left", "right", and
"neutral" with "(" being "left" example, ")" being "right" example, and
" " being "neutral" example.
We start from using the information we can extract from Unicode data and
modify it as necessary.

Then, emphasis will be defined as PRE MARKER ... MARKER POST with
PRE = left+neutral category
POST = right+neutral category

-- 
Ihor Radchenko // yantar92,
Org mode contributor,
Learn more about Org mode at .
Support Org development at ,
or support my work at 



Re: org parser and priorities of inline elements

2021-11-27 Thread Nicolas Goaziou
Hello,

Max Nikulin  writes:

> I can not estimate efforts necessary to implement priorities of
> objects (verbatim - link - emphasis) in org-elements parser since
> I have not looked into its code. Comparing the following snippets,
> I might naively expect some kind of backtracking:
>
> - A /b *c +d e+ f*g/ h
> - A /b *c +df* e+h
>
> I admit that I can be wrong and "first wins" approach handles buffer
> of incomplete parsed entities in a different way.

I don't see any incentive to change the order objects are parsed, once
you know how Org does it. This is just a red herring. What is useful,
however, is to fontify them the way Org sees them.

Regards,
-- 
Nicolas Goaziou



org parser and priorities of inline elements

2021-11-27 Thread Max Nikulin

On 21/11/2021 16:28, Ihor Radchenko wrote:


Also, is there any reason why we are not simply using punctuation
character class instead of listing punctuation chars explicitly (and
only for English)? What about "_你叫什么名字_?"


It seems punctuation character class is too broad. E.g.
¿ INVERTED QUESTION MARK
normally appears before words, while "?" is usually after them. I do not 
see anything special in

(category-set-mnemonics (char-category-set ?¿))
that may help to discriminate such cases.

An example that confuses fontification but not parser:
: false [[http://te.st/dir?b-=&a=-][verbatim]] fontification
It is a simplified example, original one:
Chris Hunt. Bug: Tildes in URL impact visible link text
Sun, 27 Dec 2020 11:44:07 -0500.
https://list.orgmode.org/CAH+Wm4-_XHUZKFTf=ztbfncpvqwkbeoegs8epym+8spmu8l...@mail.gmail.com/

Nicolas Goaziou. Thu, 18 Nov 2021 13:35:19 +0100.
https://list.orgmode.org/87y25l8wvs@nicolasgoaziou.fr

Ihor Radchenko writes:


My intuition says that the current parser behaviour is not correct. It
would make more sense to prioritise link over italics. However, it would
require a major change in the parser - instead of a single pass, the
parser may parse different types of objects sequentially. The emphasis
objects should come last avoiding the markers to have different parents.


I disagree. Priority should be given to the first object being started.
This is, IMO, the only sane way to handle syntax.


Origin of such expectation is not only TeX that changes category of 
characters for argument of verbatim commands. In markdown links and code 
have higher priorities than emphasis as well:


echo 'A _b `c_ d` e_ f' | pandoc -f markdown -t html -
A b c_ d e f

Org:
A _b =c_ d= e_ f
export result (it is more concise and easier to read than output of 
`org-element-parse-secondary-string'):


A b =c d= e_ f


Link in markdown:

echo 'A _b c  d e_ f' \
 | pandoc -f markdown -t html -
A b c https://orgmode.org/index.htm_?k=v"; 
class="uri">https://orgmode.org/index.htm_?k=v d e f


Org:

A b /c https://orgmode.org/index.htm?k=v> 
d/ e_ f



I can not estimate efforts necessary to implement priorities of objects 
(verbatim - link - emphasis) in org-elements parser since I have not 
looked into its code. Comparing the following snippets, I might naively 
expect some kind of backtracking:


- A /b *c +d e+ f*g/ h
- A /b *c +df* e+h

I admit that I can be wrong and "first wins" approach handles buffer of 
incomplete parsed entities in a different way.


P.S. In reStructured text simple nesting is not allowed, maybe it is 
possible to use replacements.