Re: [O] [RFC] Fixing link encoding once and for all

2019-03-05 Thread Robert Pluim
Neil Jerram  writes:

> Thanks for explaining that.  It's not mentioned in the manual though
> (https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html);
> are you sure that it's supported in Emacs regexps?
>

Itʼs described in the next node:



Robert



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-05 Thread Neil Jerram
Hi Nicolas,

On Tue, 5 Mar 2019 at 00:23, Nicolas Goaziou  wrote:
[...]
> So, the new challenger is:
>
> 
> "\\[\\[\\(\\(?:.\\|\n\\)*?[^\\]\\(\\)*\\)\\]\\(?:\\[\\(\\(?:.\\|\n\\)+?\\)\\]\\)?\\]"
>
> Beautiful.
>
> The commented rx equivalent would be:
>
> (seq "["
>  ;; URI part: match group 1.
>  "["
>  (group
>   (*? anything)
>   ;; Allow an even number of backslashes before the closing bracket.
>   (not (any "\\"))
>   (zero-or-more (group "")))
>  "]"
>  ;; Description (optional): match group 2.
>  (opt "[" (group (+? anything)) "]")
>  "]")
>
> > \(# begin group 3
> > ? # don't understand
> > :\[   # literal :[
>
> [...]
>
> > but there's at least a ? that I don't understand, and I'm afraid I'm
> > not seeing how it's useful.
>
> \(?: ... \) is a shy group.

Thanks for explaining that.  It's not mentioned in the manual though
(https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html);
are you sure that it's supported in Emacs regexps?

> > If you think it works, I'm happy to defer to your judgement on that!
> > Although I suggested the idea, I don't know Org nearly well enough to
> > be sure that I haven't missed problems;
>
> We are solving the problem with a regexp. What bad things could happen? ;)

Well hopefully the fallout is limited to destroying all of the text in
one Org buffer. :-)

More seriously, though, I don't understand when and how the regexp is
used.  Presumably you loop through the buffer looking for matches, but
what do you do after each match?

Regards,
Neil



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-04 Thread Nicolas Goaziou
Hello,

Neil Jerram  writes:

> On Fri, 1 Mar 2019 at 08:14, Nicolas Goaziou  wrote:

>> The regexp for bracket links could be, in its simple (!) form:
>>
>>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]
>
> [then a bit later]
>> Small update, in its string form now:
>>
>>   
>> "\\[\\[\\([^\000]*?[^\\]\\(\\)*\\)\\]\\(?:\\[\\([^\000]+?\\)\\]\\)?\\]"
>
> Is [^\000] the only (or best) way of saying "any character, including
> newlines"?

There is also "\(.\|\n\)", or "[[:ascii:][:nonascii:]]".

> Could there be actual NUL characters in the document?

Good question. I used [^\000] out of habit. You are right, "\(.\|\n\)"
is more robust.

So, the new challenger is:


"\\[\\[\\(\\(?:.\\|\n\\)*?[^\\]\\(\\)*\\)\\]\\(?:\\[\\(\\(?:.\\|\n\\)+?\\)\\]\\)?\\]"

Beautiful.

The commented rx equivalent would be:

(seq "["
 ;; URI part: match group 1.
 "["
 (group
  (*? anything)
  ;; Allow an even number of backslashes before the closing bracket.
  (not (any "\\"))
  (zero-or-more (group "")))
 "]"
 ;; Description (optional): match group 2.
 (opt "[" (group (+? anything)) "]")
 "]")

> \(# begin group 3
> ? # don't understand
> :\[   # literal :[

[...]

> but there's at least a ? that I don't understand, and I'm afraid I'm
> not seeing how it's useful.

\(?: ... \) is a shy group.

> If you think it works, I'm happy to defer to your judgement on that!
> Although I suggested the idea, I don't know Org nearly well enough to
> be sure that I haven't missed problems;

We are solving the problem with a regexp. What bad things could happen? ;)

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-04 Thread Neil Jerram
On Fri, 1 Mar 2019 at 08:14, Nicolas Goaziou  wrote:
>
> Hello,
>
> Neil Jerram  writes:
>
> > Do you mean Windows file names in existing Org files?  I.e. the
> > back-compatibility concern?
> >
> > If so, yes, I confess I didn't think at all about back-compatibility,
> > with my suggestion above.  So perhaps that rules my idea out.
> >
> > If we were starting from scratch, however,
> > - I believe it would technically be fine; i.e. it's a complete and
> > unambiguous encoding
> > - it might be considered awkward for Windows users to have to write
> > c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
> > know how big a concern that would be.
>
> Thinking a bit more about it, we don't need to escape /all/ square
> brackets, only "]]" and "][" constructs. Therefore, we don't need to
> escape every backslash either.

Agreed.

> The regexp for bracket links could be, in its simple (!) form:
>
>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]

[then a bit later]
> Small update, in its string form now:
>
>   
> "\\[\\[\\([^\000]*?[^\\]\\(\\)*\\)\\]\\(?:\\[\\([^\000]+?\\)\\]\\)?\\]"

Is [^\000] the only (or best) way of saying "any character, including
newlines"?  Could there be actual NUL characters in the document?

More generally I'm not sure I'm fully understanding the regex.  I
_think_ it breaks down like this:

\[\[  # literal [[
\(# begin group 1
[^\000]*? # non-greedy any characters (0 or more)
[^\]  # something not a backslash
\(# begin group 2
  # literal \\
\)*   # end group 2, and allow 0 or more of it
\)# end group 1
\]# literal ]
\(# begin group 3
? # don't understand
:\[   # literal :[
\(# begin group 4
[^\000]+? # non-greedy any characters (1 or more)
\)# end group 4
\]# literal ]
\)?   # end group 3, and allow 0 or 1 or it
\]# literal ]

but there's at least a ? that I don't understand, and I'm afraid I'm
not seeing how it's useful.

> Most links would need no change.  I see one notable exception:
> directories in Windows:
>
>   [[c:\system32\\]] for "c:\system32\"

But I guess it would be unusual to write a trailing backslash like that.

> Some further notes:
>
> 1. Macros already use backslashes to escape commas in arguments, so it
>is at least consistent with this part of Org.
>
> 2. The description part of the link, like most parts of Org, does not
>use backslash escaping. If needed, we can implement an entity for
>a square bracket.
>
> 3. There will be some backward compatibility issues. We can add
>a checker in Org Lint to catch most of those. For example, we could
>look at URI where every percent is followed only by 25, 5B, and 5D.
>
> WDYT?

If you think it works, I'm happy to defer to your judgement on that!
Although I suggested the idea, I don't know Org nearly well enough to
be sure that I haven't missed problems; but I guess that you would
know that.

Best wishes,
  Neil



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-03 Thread Nicolas Goaziou
Hello,

stardiviner  writes:

> Nicolas Goaziou  writes:

>> 3. There will be some backward compatibility issues. We can add
>>a checker in Org Lint to catch most of those. For example, we could
>>look at URI where every percent is followed only by 25, 5B, and 5D.
>
> About this, I'm curious, is it possible let this checker search and 
> interactive
> query replace with running recursively in a directory for all Org files. If 
> Org
> updated, I hope my Org documents are update too.

The linter is only effective on the current document, and does not offer
to change it.

Writing a function to replace such links would be great. It is not my
priority at the moment, tho. 

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-02 Thread stardiviner


Nicolas Goaziou  writes:

> Hello,
>
> Neil Jerram  writes:
>
>> Do you mean Windows file names in existing Org files?  I.e. the
>> back-compatibility concern?
>>
>> If so, yes, I confess I didn't think at all about back-compatibility,
>> with my suggestion above.  So perhaps that rules my idea out.
>>
>> If we were starting from scratch, however,
>> - I believe it would technically be fine; i.e. it's a complete and
>> unambiguous encoding
>> - it might be considered awkward for Windows users to have to write
>> c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
>> know how big a concern that would be.
>
> Thinking a bit more about it, we don't need to escape /all/ square
> brackets, only "]]" and "][" constructs. Therefore, we don't need to
> escape every backslash either.
>
> The regexp for bracket links could be, in its simple (!) form:
>
>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]
>
> Most links would need no change.  I see one notable exception:
> directories in Windows:
>
>   [[c:\system32\\]] for "c:\system32\"
>
> Some further notes:
>
> 1. Macros already use backslashes to escape commas in arguments, so it
>is at least consistent with this part of Org.
>
> 2. The description part of the link, like most parts of Org, does not
>use backslash escaping. If needed, we can implement an entity for
>a square bracket.
>
> 3. There will be some backward compatibility issues. We can add
>a checker in Org Lint to catch most of those. For example, we could
>look at URI where every percent is followed only by 25, 5B, and 5D.
>

About this, I'm curious, is it possible let this checker search and interactive
query replace with running recursively in a directory for all Org files. If Org
updated, I hope my Org documents are update too.

> WDYT?
>
> Regards,


-- 
[ stardiviner ]
   I try to make every word tell the meaning what I want to express.

   Blog: https://stardiviner.github.io/
   IRC(freenode): stardiviner, Matrix: stardiviner
   GPG: F09F650D7D674819892591401B5DF1C95AE89AC3
  



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-01 Thread Jens Lechtenboerger
On 2019-03-01, Nicolas Goaziou wrote:

> Jens Lechtenboerger  writes:
>
>> On 2019-03-01, Nicolas Goaziou wrote:
>>
>>> 3. There will be some backward compatibility issues. We can add
>>>a checker in Org Lint to catch most of those. For example, we could
>>>look at URI where every percent is followed only by 25, 5B, and 5D.
>>
>> I do not understand this point.  What is special about URIs where
>> *only* those occur?  Might compatibility issues not arise if those
>> occur at all (while others such as %28 and %29 for parentheses might
>> occur without problems as well)?
>
> If a URI seems percent encoded, but only uses %25, %5B and %5D as escape
> combinations, there is a high chance that it is Org-encoded, and
> therefore uses a deprecated syntax. We could send a warning to the user
> in this case; they might want to clean the URI.
>
> OTOH, if there is %28, or %29, we are sure it isn't Org-encoded, and
> therefore, the percent-encoding was intended right from the start (like
> in your Wikipedia link).

Thanks for the clarification.  Makes sense.

Best wishes
Jens



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-01 Thread Nicolas Goaziou
Hello,

Jens Lechtenboerger  writes:

> On 2019-03-01, Nicolas Goaziou wrote:
>
>> 3. There will be some backward compatibility issues. We can add
>>a checker in Org Lint to catch most of those. For example, we could
>>look at URI where every percent is followed only by 25, 5B, and 5D.
>
> I do not understand this point.  What is special about URIs where
> *only* those occur?  Might compatibility issues not arise if those
> occur at all (while others such as %28 and %29 for parentheses might
> occur without problems as well)?

If a URI seems percent encoded, but only uses %25, %5B and %5D as escape
combinations, there is a high chance that it is Org-encoded, and
therefore uses a deprecated syntax. We could send a warning to the user
in this case; they might want to clean the URI.

OTOH, if there is %28, or %29, we are sure it isn't Org-encoded, and
therefore, the percent-encoding was intended right from the start (like
in your Wikipedia link).

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-01 Thread Jens Lechtenboerger
Hi there,

I like this proposal.

On 2019-03-01, Nicolas Goaziou wrote:

> 3. There will be some backward compatibility issues. We can add
>a checker in Org Lint to catch most of those. For example, we could
>look at URI where every percent is followed only by 25, 5B, and 5D.

I do not understand this point.  What is special about URIs where
*only* those occur?  Might compatibility issues not arise if those
occur at all (while others such as %28 and %29 for parentheses might
occur without problems as well)?

Best wishes
Jens



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-01 Thread Michael Brand
On Fri, Mar 1, 2019 at 9:15 AM Nicolas Goaziou  wrote:

> Thinking a bit more about it, we don't need to escape /all/ square
> brackets, only "]]" and "][" constructs. Therefore, we don't need to
> escape every backslash either.

Brilliant!



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-01 Thread Nicolas Goaziou
Nicolas Goaziou  writes:

> The regexp for bracket links could be, in its simple (!) form:
>
>   \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]

Small update, in its string form now:

  
"\\[\\[\\([^\000]*?[^\\]\\(\\)*\\)\\]\\(?:\\[\\([^\000]+?\\)\\]\\)?\\]"



Re: [O] [RFC] Fixing link encoding once and for all

2019-03-01 Thread Nicolas Goaziou
Hello,

Neil Jerram  writes:

> Do you mean Windows file names in existing Org files?  I.e. the
> back-compatibility concern?
>
> If so, yes, I confess I didn't think at all about back-compatibility,
> with my suggestion above.  So perhaps that rules my idea out.
>
> If we were starting from scratch, however,
> - I believe it would technically be fine; i.e. it's a complete and
> unambiguous encoding
> - it might be considered awkward for Windows users to have to write
> c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
> know how big a concern that would be.

Thinking a bit more about it, we don't need to escape /all/ square
brackets, only "]]" and "][" constructs. Therefore, we don't need to
escape every backslash either.

The regexp for bracket links could be, in its simple (!) form:

  \[\[\(.*?[^\\]\(?:\\\)*\)\]\(?:\[\([^\000]+?\)\]\)?\]

Most links would need no change.  I see one notable exception:
directories in Windows:

  [[c:\system32\\]] for "c:\system32\"

Some further notes:

1. Macros already use backslashes to escape commas in arguments, so it
   is at least consistent with this part of Org.
   
2. The description part of the link, like most parts of Org, does not
   use backslash escaping. If needed, we can implement an entity for
   a square bracket.

3. There will be some backward compatibility issues. We can add
   a checker in Org Lint to catch most of those. For example, we could
   look at URI where every percent is followed only by 25, 5B, and 5D.

WDYT?

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-28 Thread Nicolas Goaziou
Hello,

Jens Lechtenboerger  writes:

> I copied that from the address bar of my browser, probably two years
> ago.  Today, I was surprised by a compilation failure.

Link syntax is currently unstable. We fix it on one side and it breaks
elsewhere. 

This thread is an attempt to make the link syntax stable. It will not
necessarily solve your example, tho.

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-28 Thread Neil Jerram
On Wed, 27 Feb 2019 at 10:49, Nicolas Goaziou  wrote:
>
> Hello,
>
> Neil Jerram  writes:
>
> > I'm not sure how much freedom you have here, but I think it would be
> > both clearer - by avoiding confusion with URL-escaping - and easier to
> > type, to use an entirely different form of escaping in the Org syntax;
> > probably just this:
> >
> > \[ and \] to include a square bracket in a link
> > \\ to include a backslash
>
> Wouldn't that become problematic with file names in Windows?

Do you mean Windows file names in existing Org files?  I.e. the
back-compatibility concern?

If so, yes, I confess I didn't think at all about back-compatibility,
with my suggestion above.  So perhaps that rules my idea out.

If we were starting from scratch, however,
- I believe it would technically be fine; i.e. it's a complete and
unambiguous encoding
- it might be considered awkward for Windows users to have to write
c:\\system32\\mydoc.txt instead of c:\system32\mydoc.txt, but I don't
know how big a concern that would be.

Best wishes,
 Neil


> Regards,
>
> --
> Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-27 Thread Jens Lechtenboerger
On 2019-02-27, Nicolas Goaziou wrote:

> Hello,
>
> Jens Lechtenboerger  writes:
>
>> When exporting the following link to LaTeX, the decoding fails.
>>
>> --8<---cut here---start->8---
>> [[https://en.wikipedia.org/wiki/Red%E2%80%93black_tree][Red-black trees]]
>> --8<---cut here---end--->8---
>
> According to my suggestion in this thread, this link should be written
>
>   [[https://en.wikipedia.org/wiki/Red%25E2%2580%2593black_tree][Red-black 
> trees]]
>
> i.e., either you wrote it by hand, or `org-insert-link' failed.

I copied that from the address bar of my browser, probably two years
ago.  Today, I was surprised by a compilation failure.

> With the \-escape solution suggested by Neil, it would be correctly
> processed without additional change. Of course, that would entail other
> difficulties.

You mentioned Windows file names.  I’m not affected by that.  URLs
in my Org files neither contain “[” nor “\” (but lots of “%”).  So
the suggestion by Neil would be fine for me.

Best wishes
Jens



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-27 Thread Nicolas Goaziou
Hello,

Jens Lechtenboerger  writes:

> When exporting the following link to LaTeX, the decoding fails.
>
> --8<---cut here---start->8---
> [[https://en.wikipedia.org/wiki/Red%E2%80%93black_tree][Red-black trees]]
> --8<---cut here---end--->8---

According to my suggestion in this thread, this link should be written

  [[https://en.wikipedia.org/wiki/Red%25E2%2580%2593black_tree][Red-black 
trees]]

i.e., either you wrote it by hand, or `org-insert-link' failed.

With the \-escape solution suggested by Neil, it would be correctly
processed without additional change. Of course, that would entail other
difficulties.

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-27 Thread Nicolas Goaziou
Hello,

Neil Jerram  writes:

> I'm not sure how much freedom you have here, but I think it would be
> both clearer - by avoiding confusion with URL-escaping - and easier to
> type, to use an entirely different form of escaping in the Org syntax;
> probably just this:
>
> \[ and \] to include a square bracket in a link
> \\ to include a backslash

Wouldn't that become problematic with file names in Windows?

Regards,

-- 
Nicolas Goaziou



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-27 Thread Jens Lechtenboerger
On 2019-02-24, Nicolas Goaziou wrote:

> Recently[1], issues about link escaping have resurfaced. I'd like to
> solve this once and for all.

Good morning,

I updated to Org mode version 9.2.1 (9.2.1-33-g029cf6-elpa @
/home/user/.emacs.d/elpa/org-20190225/).

When exporting the following link to LaTeX, the decoding fails.

--8<---cut here---start->8---
[[https://en.wikipedia.org/wiki/Red%E2%80%93black_tree][Red-black trees]]
--8<---cut here---end--->8---

The output is this:
--8<---cut here---start->8---
\href{https://en.wikipedia.org/wiki/Red\â\€\“black\_tree}{Red-black trees}
--8<---cut here---end--->8---

Previously, I got:
--8<---cut here---start->8---
\href{https://en.wikipedia.org/wiki/Red\%E2\%80\%93black\_tree}{Red-black trees}
--8<---cut here---end--->8---

Best wishes
Jens



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-25 Thread stardiviner


Nicolas Goaziou  writes:

> Hello,
>
> Recently[1], issues about link escaping have resurfaced. I'd like to
> solve this once and for all.
>
> As a reminder, the initial issue is that bracket links, i.e., "[[path]]"
> or "[[path][description]]", cannot contain square brackets, for obvious
> reasons. Therefore, they need to be escaped somehow. For some historical
> reason, the "somehow" settled, for the path part[2], on URL encoding.
> Therefore [ and ] in a link must appear as, respectively, "%5B" and
> "%5D". Of course, the initial link could already contain any of these
> strings, so percent signs also need to be escaped, as "%25". Eventually,
> consecutive spaces are not very handled very gracefully by
> `fill-paragraph' function, so it is also useful, but not mandatory, to
> be able to escape white spaces, with "%20". It can sadly be confusing
> when Org encoding is applied on top an already encoded URI.
>
> To sum it up, `org-link-escape', by default, URL encodes only square
> brackets, percent signs and white spaces. Note that, however,
> `org-link-unescape' is not its reciprocal function, despite its
> docstring. It URL decodes every percent encoded combination.
>
> Anyway, square brackets in a bracket link almost looks like a solved
> problem. Alas, if some links are inserted by helper functions, such as
> `org-insert-link', others could have been typed right into the buffer.
> Therefore, there is usually no way to know if a link is already
> Org-encoded or not. Consequently, there is usually no way to know when
> a link needs to be Org-decoded. This is the root of all evil, or at
> least, all bugs encountered so far. Some links end up being encoded or
> decoded once too many.
>
> To solve this, we must assume that every bracket link is properly
> Org-encoded in a buffer. In other words, when typing, or yanking,
> a bracket link right into a buffer, users are required to use %5B, %5D,
> and %25 in the path part of the link, if necessary. I understand it will
> bite some users, but using `org-insert-link' would mitigate the pain. It
> is also limited to square brackets, which, I assume, is not the type of
> link you usually yank.
>
> With that assumption, the parser can safely Org-decode links
> appropriately, and store paths in their decoded form. Consumers, like
> export back-ends, need not call `org-link-unescape' anymore. In fact,
> the only situation where `org-link-unescape' is still needed is when
> extracting the path part of a bracket link from the buffer, e.g.,
> through regexp matching.
>
> Of course, the manual should mention this assumption, if we agree on it.
>
> Thoughts?
>
> Regards,
>

I agree and upvote on this. Use `org-insert-link' as unique entry will help
unify all behavior. The only inconvenient of inserting link literately is where
user can't access `org-insert-link'. Like on web, in other editor. But I think
whatever Org Mode is limited in Emacs already, so no matter add this on. Also,
at the end, if other clients want to support Org Mode, then can insert link with
encoded and handle this properly.

WDYT?

> Footnotes: 
>
> [1] E.g., 
> or .
>
> [2] There is no clear mechanism for the description part.
> `org-insert-link' will replace square brackets with curly ones. We could
> also use entities, but none of them appears as a square bracket. Anyway,
> I'll ignore this issue for the time being.


-- 
[ stardiviner ]
   I try to make every word tell the meaning what I want to express.

   Blog: https://stardiviner.github.io/
   IRC(freenode): stardiviner, Matrix: stardiviner
   GPG: F09F650D7D674819892591401B5DF1C95AE89AC3
  



Re: [O] [RFC] Fixing link encoding once and for all

2019-02-24 Thread Neil Jerram
I'm not sure how much freedom you have here, but I think it would be
both clearer - by avoiding confusion with URL-escaping - and easier to
type, to use an entirely different form of escaping in the Org syntax;
probably just this:

\[ and \] to include a square bracket in a link
\\ to include a backslash

Regards,
Neil

On Sun, 24 Feb 2019 at 01:18, Nicolas Goaziou  wrote:
>
> Hello,
>
> Recently[1], issues about link escaping have resurfaced. I'd like to
> solve this once and for all.
>
> As a reminder, the initial issue is that bracket links, i.e., "[[path]]"
> or "[[path][description]]", cannot contain square brackets, for obvious
> reasons. Therefore, they need to be escaped somehow. For some historical
> reason, the "somehow" settled, for the path part[2], on URL encoding.
> Therefore [ and ] in a link must appear as, respectively, "%5B" and
> "%5D". Of course, the initial link could already contain any of these
> strings, so percent signs also need to be escaped, as "%25". Eventually,
> consecutive spaces are not very handled very gracefully by
> `fill-paragraph' function, so it is also useful, but not mandatory, to
> be able to escape white spaces, with "%20". It can sadly be confusing
> when Org encoding is applied on top an already encoded URI.
>
> To sum it up, `org-link-escape', by default, URL encodes only square
> brackets, percent signs and white spaces. Note that, however,
> `org-link-unescape' is not its reciprocal function, despite its
> docstring. It URL decodes every percent encoded combination.
>
> Anyway, square brackets in a bracket link almost looks like a solved
> problem. Alas, if some links are inserted by helper functions, such as
> `org-insert-link', others could have been typed right into the buffer.
> Therefore, there is usually no way to know if a link is already
> Org-encoded or not. Consequently, there is usually no way to know when
> a link needs to be Org-decoded. This is the root of all evil, or at
> least, all bugs encountered so far. Some links end up being encoded or
> decoded once too many.
>
> To solve this, we must assume that every bracket link is properly
> Org-encoded in a buffer. In other words, when typing, or yanking,
> a bracket link right into a buffer, users are required to use %5B, %5D,
> and %25 in the path part of the link, if necessary. I understand it will
> bite some users, but using `org-insert-link' would mitigate the pain. It
> is also limited to square brackets, which, I assume, is not the type of
> link you usually yank.
>
> With that assumption, the parser can safely Org-decode links
> appropriately, and store paths in their decoded form. Consumers, like
> export back-ends, need not call `org-link-unescape' anymore. In fact,
> the only situation where `org-link-unescape' is still needed is when
> extracting the path part of a bracket link from the buffer, e.g.,
> through regexp matching.
>
> Of course, the manual should mention this assumption, if we agree on it.
>
> Thoughts?
>
> Regards,
>
> Footnotes:
>
> [1] E.g., 
> or .
>
> [2] There is no clear mechanism for the description part.
> `org-insert-link' will replace square brackets with curly ones. We could
> also use entities, but none of them appears as a square bracket. Anyway,
> I'll ignore this issue for the time being.
>
> --
> Nicolas Goaziou
>



[O] [RFC] Fixing link encoding once and for all

2019-02-23 Thread Nicolas Goaziou
Hello,

Recently[1], issues about link escaping have resurfaced. I'd like to
solve this once and for all.

As a reminder, the initial issue is that bracket links, i.e., "[[path]]"
or "[[path][description]]", cannot contain square brackets, for obvious
reasons. Therefore, they need to be escaped somehow. For some historical
reason, the "somehow" settled, for the path part[2], on URL encoding.
Therefore [ and ] in a link must appear as, respectively, "%5B" and
"%5D". Of course, the initial link could already contain any of these
strings, so percent signs also need to be escaped, as "%25". Eventually,
consecutive spaces are not very handled very gracefully by
`fill-paragraph' function, so it is also useful, but not mandatory, to
be able to escape white spaces, with "%20". It can sadly be confusing
when Org encoding is applied on top an already encoded URI.

To sum it up, `org-link-escape', by default, URL encodes only square
brackets, percent signs and white spaces. Note that, however,
`org-link-unescape' is not its reciprocal function, despite its
docstring. It URL decodes every percent encoded combination.

Anyway, square brackets in a bracket link almost looks like a solved
problem. Alas, if some links are inserted by helper functions, such as
`org-insert-link', others could have been typed right into the buffer.
Therefore, there is usually no way to know if a link is already
Org-encoded or not. Consequently, there is usually no way to know when
a link needs to be Org-decoded. This is the root of all evil, or at
least, all bugs encountered so far. Some links end up being encoded or
decoded once too many.

To solve this, we must assume that every bracket link is properly
Org-encoded in a buffer. In other words, when typing, or yanking,
a bracket link right into a buffer, users are required to use %5B, %5D,
and %25 in the path part of the link, if necessary. I understand it will
bite some users, but using `org-insert-link' would mitigate the pain. It
is also limited to square brackets, which, I assume, is not the type of
link you usually yank.

With that assumption, the parser can safely Org-decode links
appropriately, and store paths in their decoded form. Consumers, like
export back-ends, need not call `org-link-unescape' anymore. In fact,
the only situation where `org-link-unescape' is still needed is when
extracting the path part of a bracket link from the buffer, e.g.,
through regexp matching.

Of course, the manual should mention this assumption, if we agree on it.

Thoughts?

Regards,

Footnotes: 

[1] E.g., 
or .

[2] There is no clear mechanism for the description part.
`org-insert-link' will replace square brackets with curly ones. We could
also use entities, but none of them appears as a square bracket. Anyway,
I'll ignore this issue for the time being.

-- 
Nicolas Goaziou