> Date: Tue, 5 Feb 2019 00:05:47 +0000
> From: Richard Wordingham via Unicode <unicode@unicode.org>
> 
> > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > > by paragraph separator characters. This means characters whose bidi
> > > category is B, which includes Newline, the CR-LF pair on Windows,
> > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.
> 
> It actually gives two different definitions. Table UAX#9 4 restricts
> the type B to *appropriate newline functions; not all newlines are
> paragraph separators.

For what exactly is "appropriate newline function" one should read the
Unicode Standard, section 5.8.  My conclusions from that are different
from yours; see below.

> > Indeed, this was an oversight on my side. So, with this definition,
> > every single newline character starts a new paragraph. The result of
> > printf "Hello\nWorld\n" > world.txt
> > is a text file consisting of two paragraphs, with 5 characters in
> > each. Correct?
> 
> No, it depends on when a newline function is 'appropriate'. TUS 5.8
> Rule R2b applies - 'In simple text editors, interpret any NLF the same
> as LS'.

That's not all of what the Standard says.  Just a couple of paragraphs
above Rule R2b, there's this text:

  Note that even if an implementer knows which characters represent
  NLF on a particular platform, CR, LF, CRLF, and NEL should be
  treated the same on input and in interpretation. Only on output is
  it necessary to distinguish between them.

So in practice, IMO the above example does constitute 2 paragraphs,
regardless of the underlying platform's conventions.

Reply via email to