Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-18 Thread Dave Cridland
On Wed, 9 Dec 2020 at 19:21, Sam Whited  wrote:

> I believe this is a mischaracterization of my argument. My argument is
> "everything will have a way to get at the underlying bytes, not
> everything will have them pre-converted into code points".


I think this, in particular, is not correct.

The counter-argument - that everything can obtain a sequence of codepoints,
but might not be able to get at a sequence of octets - is more accurate.

In particular, I think anything based on Python would only receive text
nodes as `str` objects, which are codepoint-based, and the {de|en}coding to
UTF-8 is part and parcel of the XML [de]serialization.

If we're counting codepoints and we only have the UTF-8, though, this
should be fairly easy without formal decoding, assuming we do not require
normalization.


> Also "this
> gives us the option to do certain optimizations on systems that support
> them, but using code points doesn't so we should do the thing that is
> the most flexible".
>

Oh, I agree with this, as a broad principle. But I don't think it's viable
in this case.


>
> —Sam
>
> On Wed, Dec 9, 2020, at 19:09, Tedd Sterr wrote:
> > Regardless, your argument is still "bytes is more convenient for me,
> > so everyone else should do what's best for me." I don't think that's a
> > good argument.
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___
>
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Tedd Sterr
XML is a sequence of characters (not bytes.)

References mark a portion of displayed text which is rendered as a sequence of 
characters (not bytes.)

So it makes perfect sense to define references in terms of bytes.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Sam Whited
I believe this is a mischaracterization of my argument. My argument is
"everything will have a way to get at the underlying bytes, not
everything will have them pre-converted into code points". Also "this
gives us the option to do certain optimizations on systems that support
them, but using code points doesn't so we should do the thing that is
the most flexible".

—Sam

On Wed, Dec 9, 2020, at 19:09, Tedd Sterr wrote:
> Regardless, your argument is still "bytes is more convenient for me,
> so everyone else should do what's best for me." I don't think that's a
> good argument.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Tedd Sterr
>> The decoding _should_ be done upfront - that's how you get a valid XML 
>> document.

> I don't think this is true. XML is defined as UTF-8 (in this case),
> which is a collection of bytes. They don't have to be separated out and
> transformed into some higher representation of code points. Just because
> Python et al. convert things into UTF-32 strings first doesn't mean
> everything has to.
>
> Regardless of what language you're using it's trivial to deal with this
> as a UTF-8 byte stream, it is not always trivial to handle this as a UTF-
> 32 integer stream as the example shows.

XML is defined as a sequence of characters, it doesn't specify how those 
character must be encoded (though it does require support for both UTF-8 and 
UTF-16.) UTF-7/8/16/32 are encoding schemes, not character representations - 
people do make the mistake of conflating the two things, but that doesn't mean 
they are the same.

Unicode doesn't specify the size of characters - they don't have a specific 
bit-width, they are as large as required; the encoding scheme is then a method 
to transform characters into a sequence of bytes. It shouldn't matter what 
encoding scheme is used - UTF-8, UTF-16, ISO-8859-9, ISO-2022-JP, Shift_JIS, 
EBCDIC, are all possibilities - because you're supposed to decode the data into 
characters before doing anything it.

The fact that you're able to take advantage of the foreknowledge of your data 
being encoded using UTF-8 is purely because XMPP happens to define it that way, 
not because XML is defined using any specific encoding scheme. Basing your 
entire implementation around the expectation of UTF-8 allows you to take some 
convenient short-cuts, but much of that only works because XML markup uses 
ASCII-compatible characters, which conveniently have an equivalent single-byte 
representation when encoded as UTF-8; if it were almost any other encoding then 
it simply wouldn't work without some form of decoding first. If you insist on 
not decoding and then run into difficulties with handling characters because 
you're purposely avoiding handling characters while simultaneously using XML 
which is defined as a sequence of characters, an appropriate response is "what 
did you expect?"

It's not trivial to handle everything as UTF-8 in implementations where the 
application receives already decoded strings (a sequence of characters, not 
bytes) from the XML parser. The most likely approach to dealing with that will 
be to re-encode the already decoded data back into UTF-8 just to deal with the 
offsets, which is precisely the kind of inefficient processing you're 
suggesting should be avoided. And considering the whole purpose of references 
is for marking sequences of characters, those characters are going to be 
decoded at some point; you're trying to avoid decoding early, while still 
validating offsets, so that the decoding can be done later anyway.

Regardless, your argument is still "bytes is more convenient for me, so 
everyone else should do what's best for me." I don't think that's a good 
argument.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Sam Whited
I don't think this is true. XML is defined as UTF-8 (in this case),
which is a collection of bytes. They don't have to be separated out and
transformed into some higher representation of code points. Just because
Python et al. convert things into UTF-32 strings first doesn't mean
everything has to.

Regardless of what language you're using it's trivial to deal with this
as a UTF-8 byte stream, it is not always trivial to handle this as a UTF-
32 integer stream as the example shows.

—Sam

On Wed, Dec 9, 2020, at 14:03, Tedd Sterr wrote:
> The decoding _should_ be done upfront - that's how you get a valid XML
> document.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Jonas Schäfer
For the record:

On Dienstag, 8. Dezember 2020 23:13:08 CET Sam Whited wrote:
> I don't understand how this is part of the XML data model. Do you mean
> that only Unicode encodings are supported by XML? If so, that's fair and
> removes one of my arguments, I did not know that was the case. However,
> I still think the data on the wire should describe the other data on the
> wire, not some higher- level "decoded" representation that many XML
> libraries may not even use.

Let me dig up the references:

https://www.w3.org/TR/REC-xml/#charsets

> [Definition: A parsed entity contains text, a sequence of characters, which 
may represent markup or character data.]

text = sequence of characters, representing markup or character data

https://www.w3.org/TR/REC-xml/#syntax

>  [Definition: All text that is not markup constitutes the character data of 
the document.] 

Ok, so we have text which is a sequence of characters, and what isn’t markup 
is character data.

Now what are characters in XML? Back to: 
https://www.w3.org/TR/REC-xml/#charsets

> [Definition: A character is an atomic unit of text as specified by ISO/IEC 
10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line 
feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of 
these standards cited in A.1 Normative References were current at the time 
this document was prepared. New characters may be added to these standards by 
amendments or new editions. Consequently, XML processors MUST accept any 
character in the range specified for Char. ] 

That is the definition of a subset of the Unicode code point range:

> [2]   Char   ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | 
> [#xE000-
#xFFFD] | [#x1-#x10]/* any Unicode character, excluding the 
surrogate blocks, FFFE, and . */

kind regards,
Jonas


___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Tedd Sterr
Sam, your argument appears to be "I want to handle everything as bytes without 
doing any string decoding, so any other option would be more effort (less 
efficient) for me."

XML is defined as a sequence of characters, not bytes - those characters 
subsequently need to be transformed into bytes for the purpose of 
storage/transmission, and that's defined by the encoding scheme (UTF-8 in this 
case.) Bytes is convenient for you, but not for everyone else using a language 
that does the decoding upfront. The decoding _should_ be done upfront - that's 
how you get a valid XML document.

If you're trying to handle XML without first decoding from UTF-8 so you can 
save a few clock-cycles, that's cool, but you are going to run into awkward 
annoyances when it comes to trying to handle such alien concepts as characters. 
The reason you can mostly get away with not decoding is because the lower half 
of ASCII is represented the same way when using UTF-8, so you can pretend the 
XML tags are encoded as ASCII characters and just treat any Unicode strings as 
opaque binary blobs - but that is only a convenient hack. If everyone else is 
to go along with your convenient hack, that only means they will have to deal 
with their own awkward annoyances because they made the terrible decision to 
decode strings before handling them (as if that's what you're actually supposed 
to do.)

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Sam Whited
To try and show why I'm pushing back on this so hard here is an example
of doing this three different ways: one assuming the references are
bytes, two assuming the references are code points.

https://play.golang.org/p/kKbr2hXd56U

The third one I was forgetting I can do, and it looks quite nice (if we
ignore the performance cost as people seem to want to do) but we can't
do any error handling for reasons explained in the comments. If we're a
client this may not matter, it's not the end of the world if we show the
user a reference that starts or ends with an ugly error character box or
something, if we're the server this might matter more, either way, I
think having a sane way to do error handling on bad references is a
requirement:

Of course, this is Go specific but the solutions probably look similar
in other C-like languages. I should also note that this is using a
higher level decoding API than I am using, but it doesn't matter since
the extra boilerplate required to do this at the lower- level where you
get byte slices out would look the same for the first two examples.
However it would require extra work for me to do the third example
(because it would give me []byte, not a string) which makes it even less
practical and the third example isn't a convenience that exists in eg.
C, so generally it's worth just ignoring.

If I'm having to pick between the code in the first and second example,
please let me pick the first.

—Sam

On Tue, Dec 8, 2020, at 22:13, Sam Whited wrote:
> The XML library I use does not give me a string or slice of code
> points, it gives me a slice of bytes because that's the level I'm
> operating at. Even at the higher level if I decode the bytes into a
> string (A Go string in this case), that is still just a slice of UTF-8
> bytes (it does not decode them, ensure they're valid, and turn them
> into a slice of code points, that is a very expensive operation that
> it avoids until you need it or explicitly do it yourself).
>
> I don't understand how this is part of the XML data model. Do you mean
> that only Unicode encodings are supported by XML? If so, that's fair
> and removes one of my arguments, I did not know that was the case.
> However, I still think the data on the wire should describe the other
> data on the wire, not some higher- level "decoded" representation that
> many XML libraries may not even use.
>
> —Sam
>
> On Tue, Dec 8, 2020, at 21:32, Jonas Schäfer wrote:
> > But all implementations which want to be XMPP and XML 1.0 compliant
> > need to have some way to convert or offer access to code points, as
> > that’s the XML data model. Let’s build on that.
> >
> > Easy choice.
> >
> > Much easier than writing 20 emails on this topic, and that just in
> > this thread.
> ___
> Standards mailing list Info:
> https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: Standards-
> unsubscr...@xmpp.org
> ___
>

-- 
Sam Whited
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Marvin W
Hi,

On 09.12.20 08:59, Florian Schmaus wrote:
> But the recipient would be able to apply the same rules regarding
> localization as the sender when counting grapheme clusters.

Which rules? Unicode does not provide a locale specific grapheme
clustering algorithm, TR29 only mentions that those exist and that it
only provides a "default" algorithm that can be extended upon with
locale specific rules. AFAIK there is not standard that properly defines
grapheme clustering other than the TR29 algorithm which specifically
declares to not create proper locale-specific grapheme clusters. The
only thing we can do is say "do what TR29 says" (it actually gives two
options, but lets just stick with extended grapheme clusters). However,
TR29 itself does not make any statements regarding its stability and
Unicode updates in the last years did change TR29 behavior even for
existing codepoints. Thus if we rely on TR29 algorithm we need to
specify a version of it, which in general is a bad idea.

> I also suggest that the receiving side is considered. For example:
> "Entities that receive character counted text should normalize the
> counted text to Unicode Normalization Form C (NFC) [1] form prior
> evaluating the character indexes."

As I mentioned earlier, normalizing is changing the codepoints and thus
(in XML layer) changing the transferred content. In my tests, I haven't
seen any current server implementation doing that. Worst case,
normalizing can result in messages getting unreadable to the receiving
client that otherwise would have been readable (if the server has a
newer unicode version than both client's fonts). So instead of adding
client side behavior to handle servers doing modifications, I'd rather
codify that servers SHOULD NOT modify the codepoints in . Where we
put this rule is another question.

In my draft I specifically had the rule that if an entity applies
normalization they have to update the indices if needed. This also
applies to receiving entities which is incompatible with what you wrote
(or at least I understand that you want to normalize without updating
indices).

Here is the rationale behind that:
Normalization as per TR15 is considered stable, which means that as long
as you only use codepoints that are defined in the Unicode version your
code uses, any future Unicode/TR15 version will consider the string
normalized. In other terms, this means that to ensure your client only
sends normalized strings (which you would need to, so that any other
entity can apply normalization without changing indices), you'd have to
restrict your client to only send codepoint that are defined in the
Unicode version it supports.
However in practice, users have been sending codepoints that are not
part of the Unicode specification implemented in their clients. This is
because you can practically use new emojis (and their codepoints) as
soon as they appear in popular fonts.

Just to make an example: To support latest Emojis in Android apps, you
can use the "EmojiCompat" support library (that includes a font with all
emojis of the latest version) and thereby become able to display them.
However, the supported Unicode version for all text processing still
remains the version implemented by the ICU4J version shipped with the
operating system. About 60% of Android devices currently in use have
Android 9 or earlier and thus implement Unicode 10.0 or earlier (which
was released mid 2017). Thus 60% of Android devices would not be able to
correctly normalize messages that include the 裂 microbe emoji. Thus, in
practice, sending clients cannot guarantee to send normalized strings
without severely harming user experience by not accepting new
codepoints. This also means that receiving clients cannot rely on
receiving normalized messages or messages where indices refer to
normalized messages.

Marvin
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Florian Schmaus
On 12/7/20 11:34 PM, Marvin W wrote> On 07.12.20 19:34, Florian Schmaus 
wrote:

We do have xml:lang, don't we?


Unforunately, it doesn't help in all cases. It's perfectly fine to write
a message with xml:lang="en":

"chlapec" is "boy" in slowak

This is 27 grapheme clusters, but I guess most western people would
count it as 28.


But the recipient would be able to apply the same rules regarding 
localization as the sender when counting grapheme clusters.




Let us ignore grapheme clusters for a moment and focus on XEP-0426:
Have you considered Unicode normalization? Especially when a text
that was originally in decomposed form is normalized to composed
form. This would corrupt the code point indexes.

[..]

I think that due to this, XEP-0426 should specify that counting
happens with the text in NFC form. Or am I missing something?


I could imagine going for something like:


Yes, that definitely goes into the right direction.



Receiving or intermediary entities SHOULD not apply Unicode
normalization to the text referenced from character counting.


I am not sure that you can (or that we should) put normative text that 
applies to intermediate hops into XEP-0426. The XEP could/should limit 
itself to describe normative clauses for the point end-points exchanging 
character counting data.




If
entities apply Unicode normalization, they SHOULD update all
positions, indices and lengths derived from character counting if
required.


As above. I think this would need at least a discoverable disco#info 
feature. But even then, I doubt that this is useful in a normative form. 
However, it probably can not hurt to have XEP-0426 spell this out as 
recommendation in an informative way.




It is RECOMMENDED that entities creating the original
stanzas use NFC form.


Now that is the part I really like and which I believe to be missing 
from XEP-0426. +1


I also suggest that the receiving side is considered. For example: 
"Entities that receive character counted text should normalize the 
counted text to Unicode Normalization Form C (NFC) [1] form prior 
evaluating the character indexes."


1: https://unicode.org/reports/tr15/

- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-08 Thread Kevin Smith
On 8 Dec 2020, at 22:13, Sam Whited  wrote:
> I still think the data on the wire should describe the other data on the
> wire, not some higher- level "decoded” representation

Agree 100%. References et al. need to calculate how the data are going to be 
encoded on the wire, not some high level abstraction. Decoding TLS is very 
expensive and I shouldn’t have to do that before I’m able to work out what the 
text being referenced is.

;)

/K



___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-08 Thread Sam Whited
The XML library I use does not give me a string or slice of code points,
it gives me a slice of bytes because that's the level I'm operating at.
Even at the higher level if I decode the bytes into a string (A Go
string in this case), that is still just a slice of UTF-8 bytes (it does
not decode them, ensure they're valid, and turn them into a slice of
code points, that is a very expensive operation that it avoids until you
need it or explicitly do it yourself).

I don't understand how this is part of the XML data model. Do you mean
that only Unicode encodings are supported by XML? If so, that's fair and
removes one of my arguments, I did not know that was the case. However,
I still think the data on the wire should describe the other data on the
wire, not some higher- level "decoded" representation that many XML
libraries may not even use.

—Sam

On Tue, Dec 8, 2020, at 21:32, Jonas Schäfer wrote:
> But all implementations which want to be XMPP and XML 1.0 compliant
> need to have some way to convert or offer access to code points, as
> that’s the XML data model. Let’s build on that.
>
> Easy choice.
>
> Much easier than writing 20 emails on this topic, and that just in
> this thread.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-08 Thread Jonas Schäfer
On Freitag, 4. Dezember 2020 21:33:38 CET Sam Whited wrote:
> On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote:
> > I begin to feel that a lot of your rationale is based on the idea that
> > you always (/often?) have access to the raw UTF-8 bytes as they
> > appeared on the wire.
> 
> Yes, most of it is.
> 
> > While is is probably true for languages where the String type's native
> > encoding is also UTF-8. It is usually not true for others. For
> > example, widely used XML parser in Java will return Java's String
> > type, which is UTF-16 (or ISO-8859-1 [1]) based.
> 
> Yes, this is fair, I was thinking you could probably always get the raw
> bytes, but it does look like a lot of these *only* do DOM based parsing
> and don't keep the original representation.

This has nothing to do with DOM vs. whatever. SAX can also give you the data 
in the format which is described by the XML model (code points).

So it appears there are two sides and arguing from the point of view of 
programming languages will give us always those who get the raw representation 
of the data on the wire (C-ish things) and those who get the high-level 
representation.

Thus, I propose that we stick with what the standards offer. XMPP is based on 
XML in that all data exchanged is somehow wrapped in XML. XML specifies that 
all character data (text) is a sequence of unicode code points. The encoding 
on the wire is irrelevant after decoding of XML; on the *abstract* layer, XML 
provides sequences of code points, nothing else. 

Some libraries always convert to UTF-8 (libxml2), some bindings always offer 
some kind of unicode codepoints (e.g. python which opportunistically chooses 
ASCII/UCS-2/UCS-4 depending on the data), some bindings may even expose the 
raw bytes and let the user deal with it (I think there was/is a zero-copy 
implementation which mostly consisted of strategically replacing XML 
metacharacters with NUL bytes in the incoming data).

But all implementations which want to be XMPP and XML 1.0 compliant need to 
have some way to convert or offer access to code points, as that’s the XML data 
model. Let’s build on that.

Easy choice.

Much easier than writing 20 emails on this topic, and that just in this 
thread.

> > However, given that there is a wide variety here, I am not sure if it
> > is worth to take any of that into consideration.
> 
> Yes, fair enough.
> 
> > Instead, my rationale is based on the idea that you always have
> > access to the Unicode code points of the textual content obtained
> > from the XML.
> 
> I do not have that access without converting from UTF-8 to code points
> in the hot-path where it would be inappropriate. It's effectively the
> same thing: I don't want to convert from bytes to code points, you don't
> want to convert from codepoints to bytes. Some languages will have to do
> the conversion either way, so it seems worth using the thing that allows
> for the most flexibility with the least amount of work in eg. IoT
> devices using C that are trying to optimize for performance where
> passing along the bytes as received on the wire (possibly with some
> validation that the range is accurate) is acceptable.

Note that you do not have to decode UTF-8 (which can be between O(n) and 
O(n^2) depending on the implementation and circumstances) to count code 
points; you can certainly do the counting in O(n) (which is the same as 
strlen() in C). And it would be similarly easy to write algorithms to do 
efficient batched codepoint indexed operations on UTF-8 strings in C (such as 
splitting UTF-8 byte ranges based on start/end information or such), if you 
really wanted to do such things in C.

However, I also think that the IoT use-case is a bit strawmanny, given that 
IoT devices would rarely have to deal with markup or other rich human-facing 
formats which require decoding of such codepoint references.

Thus ... I don’t buy this argument. Devices which render markup or references 
would have to deal with complexity way beyond this. And they’ll have to do the 
decoding anyway to do some kind of text rendering.

kind regards,
Jonas

signature.asc
Description: This is a digitally signed message part.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-07 Thread Marvin W
Hi,

On 07.12.20 19:34, Florian Schmaus wrote:
> We do have xml:lang, don't we?

Unforunately, it doesn't help in all cases. It's perfectly fine to write
a message with xml:lang="en":

> "chlapec" is "boy" in slowak

This is 27 grapheme clusters, but I guess most western people would
count it as 28.

> Let us ignore grapheme clusters for a moment and focus on XEP-0426: 
> Have you considered Unicode normalization? Especially when a text 
> that was originally in decomposed form is normalized to composed 
> form. This would corrupt the code point indexes.
> 
> [..]
> 
> I think that due to this, XEP-0426 should specify that counting 
> happens with the text in NFC form. Or am I missing something?


Normalization was already discussed wrt to XEP 0426 (not sure if that
was on list or in chat). Normalizing any text as part of processing is
modifying the content (as per the XML specification). For most purposes
we assume that the server does not modify the  of a message or any
other XML element which isn't clearly servers domain to modify. If we
assume servers modify the (XML) content of  any attempt to
character counting can become worthless anyway.

I could imagine going for something like:

> Receiving or intermediary entities SHOULD not apply Unicode 
> normalization to the text referenced from character counting. If 
> entities apply Unicode normalization, they SHOULD update all 
> positions, indices and lengths derived from character counting if 
> required. It is RECOMMENDED that entities creating the original 
> stanzas use NFC form.

Marvin
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-07 Thread Florian Schmaus

Hi Marvin :)

On 12/7/20 4:22 PM, Marvin W wrote:

On 04.12.20 21:23, Florian Schmaus wrote:

And I am in favor of code points because it allows us to aim for the
  extended grapheme cluster algorithm, while also allowing for the
"simply count code points" fallback.


XEP-0426 already discusses why it's using codepoints instead of
grapheme clusters in its rationale:

[…]

>

Also I forgot to mention that grapheme clusters are locale specific
(example: "ch" is considered a single grapheme cluster in slowak).


We do have xml:lang, don't we?


Finally, I don't think that it's generally inappropriate to point inside
a grapheme cluster (even if that's hard to implement). An example of
where it seems appropriate to reference a part of a grapheme cluster is
this: https://larma.de/grapheme.html


Fair point. (I am not sure about the relevance, though).

Let us ignore grapheme clusters for a moment and focus on XEP-0426: Have 
you considered Unicode normalization? Especially when a text that was 
originally in decomposed form is normalized to composed form. This would 
corrupt the code point indexes.


XMPP does not require any Unicode normal form. Nor does XML 1.0 (as far 
as I can tell). Furthermore, XMPP does not require that the Unicode form 
is maintained.


Hence it would be perfectly possible that the Unicode normal form of 
text exchanged via XMPP changes between hops. While I am not aware of an 
implementation that does that, it is not forbidden. And when you think 
that this will never happen, then please also keep in mind that stanzas 
may be persisted in a database. For example when put in the MAM archive. 
And a database engine may perform normalization of the data.


I think that due to this, XEP-0426 should specify that counting happens 
with the text in NFC form. Or am I missing something?


- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-07 Thread Marvin W
Hi,

On 04.12.20 21:23, Florian Schmaus wrote:
> And I am in favor of code points because it allows us to aim for the
>  extended grapheme cluster algorithm, while also allowing for the 
> "simply count code points" fallback.

XEP-0426 already discusses why it's using codepoints instead of
grapheme clusters in its rationale:

> The most obvious way of counting characters is to count them how 
> humans would. This sounds easy when only having western scripts in 
> mind but becomes more complicated in other scripts and most 
> importantly is not well-defined across Unicode versions. New unicode 
> versions regularly added new possibilities to build grapheme 
> clusters, including from existing code points. To be forward 
> compatible, counting grapheme clusters, graphemes, glyphs or similar 
> is thus not an option.

Also I forgot to mention that grapheme clusters are locale specific
(example: "ch" is considered a single grapheme cluster in slowak). The
TR#29 even says:

> The Unicode definitions of grapheme clusters are defaults: not meant
> to exclude the use of more sophisticated definitions of tailored
> grapheme clusters where appropriate.

Finally, I don't think that it's generally inappropriate to point inside
a grapheme cluster (even if that's hard to implement). An example of
where it seems appropriate to reference a part of a grapheme cluster is
this: https://larma.de/grapheme.html

Marvin
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited


On Fri, Dec 4, 2020, at 20:53, Florian Schmaus wrote:
> If you count the bytes of the UTF-8 encoded representation, then there
> is no way to have any fallback (as the indexes would be wrong).

Maybe I don't understand the fallback you're proposing. I do understand
your example, and assert that it doesn't matter. You're not likely to
have an invalid offset and if you do then we can define a fallback for
that. It might be "the range ends at the start of the codepoint" (so you
have to decode a single codepoint, not the entire range), or it might be
"this is an invalid range, don't display anything".

> This is, of course, because in the example the number of code points
> and graphemes is identical. But this allows developers to easily
> bootstrap this scheme by simply counting code points in the beginning.
> I wouldn't be surprised if that it would work so well that they never
> even switch to grapheme counting.

We could also easily count bytes and I wouldn't be suprised if that
worked well enough and we don't have to switch to anything else.

—Sam
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Florian Schmaus

On 12/4/20 9:33 PM, Sam Whited wrote:

And I am in favor of code points because it allows us to aim for the
extended grapheme cluster algorithm, while also allowing for the
"simply count code points" fallback.


If you do bytes you could also easily convert to codepoints and then to
grapheme clusters. It also allows for the simple "count codepoints" or
"count bytes" fallback.


If you count the bytes of the UTF-8 encoded representation, then there 
is no way to have any fallback (as the indexes would be wrong).


Maybe an example is able to illustrate where I see the advantage of 
counting graphemes/code points over counting the bytes of the UTF-8 
encoded representation. Consider the following text:


Über

Code points: U+00DC U+0062 U+0065 U+0072
Graphemes:   (U+00DC) (U+0062) (U+0065) (U+0072)
UTF-8 bytes: c3 8b 62 65 72

Assume we want to provide the coordinates for the span that consists of 
the first two letters. e.g.:


Über
^^

Then, with a zero-indexes scheme where start is inclusive and end is 
exclsuive, you may either end up with


start=0
end=3

if you count bytes.

But you end up with

start=0
end=2

irregardless of counting code points or graphemes.

This is, of course, because in the example the number of code points and 
graphemes is identical. But this allows developers to easily bootstrap 
this scheme by simply counting code points in the beginning. I wouldn't 
be surprised if that it would work so well that they never even switch 
to grapheme counting.


- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited


On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote:
> I begin to feel that a lot of your rationale is based on the idea that
> you always (/often?) have access to the raw UTF-8 bytes as they
> appeared on the wire.

Yes, most of it is.

> While is is probably true for languages where the String type's native
> encoding is also UTF-8. It is usually not true for others. For
> example, widely used XML parser in Java will return Java's String
> type, which is UTF-16 (or ISO-8859-1 [1]) based.

Yes, this is fair, I was thinking you could probably always get the raw
bytes, but it does look like a lot of these *only* do DOM based parsing
and don't keep the original representation.

> However, given that there is a wide variety here, I am not sure if it
> is worth to take any of that into consideration.

Yes, fair enough.

> Instead, my rationale is based on the idea that you always have
> access to the Unicode code points of the textual content obtained
> from the XML.

I do not have that access without converting from UTF-8 to code points
in the hot-path where it would be inappropriate. It's effectively the
same thing: I don't want to convert from bytes to code points, you don't
want to convert from codepoints to bytes. Some languages will have to do
the conversion either way, so it seems worth using the thing that allows
for the most flexibility with the least amount of work in eg. IoT
devices using C that are trying to optimize for performance where
passing along the bytes as received on the wire (possibly with some
validation that the range is accurate) is acceptable.

> And I am in favor of code points because it allows us to aim for the
> extended grapheme cluster algorithm, while also allowing for the
> "simply count code points" fallback.

If you do bytes you could also easily convert to codepoints and then to
grapheme clusters. It also allows for the simple "count codepoints" or
"count bytes" fallback.

—Sam
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Florian Schmaus

On 12/4/20 8:25 PM, Sam Whited wrote:

On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:

My problem with your proposal is that it uses bytes. I don't get why
you want to use bytes here.


Naturally. Likewise my problem with your proposal is that it uses code
points and I don't get why you'd want to use them here :)


I begin to feel that a lot of your rationale is based on the idea that 
you always (/often?) have access to the raw UTF-8 bytes as they appeared 
on the wire.


While is is probably true for languages where the String type's native 
encoding is also UTF-8. It is usually not true for others. For example, 
widely used XML parser in Java will return Java's String type, which is 
UTF-16 (or ISO-8859-1 [1]) based. Then there is Python 3, where the str 
type is a sequence of Unicode characters (code points). Of course, it 
would be possible to design and implement XML parsers in Java and Python 
that return strings as they appeared in the parsed XML document/stream.


However, given that there is a wide variety here, I am not sure if it is 
worth to take any of that into consideration.


Instead, my rationale is based on the idea that you always have access 
to the Unicode code points of the textual content obtained from the XML. 
And I am in favor of code points because it allows us to aim for the 
extended grapheme cluster algorithm, while also allowing for the "simply 
count code points" fallback.


Note that both methods, counting grapheme (clusters) vs. counting 
codepoints, would, if I did not miss a grapheme cluster, yield the same 
result for this e-mail.


- Florian


1: Please ignore this. I have only mentioned it for completeness. If you 
are curious, lookup "JEP 254: Compact Strings".




OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited


On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:
> Often you don't get raw bytes from your XML parser, but an instance of
> your programming language's native String type. But often your
> programming language provides an API to encode that String to UTF-8
> encoded bytes, which *should* match exactly the bytes on the wire.

That would also be expensive to do every time and I'd be willing to be
the XML parser *also* gives you the ability to get bytes. Otherwise what
would it do with XML documents that don't use the same encoding as your
language (again, I know we always use UTF-8, but an XML parser won't
know that and may have to deal with other encodings)? Would it always
implicitly convert every single thing? That seems like it will be a
potentially very slow XML parser if it doesn't have a fallback for me to
say "just give me the raw bytes".

> My problem with your proposal is that it uses bytes. I don't get why
> you want to use bytes here.

Naturally. Likewise my problem with your proposal is that it uses code
points and I don't get why you'd want to use them here :)


>  You most certainly will obtain from your XML parser a type that can
>  be converted to a sequence of Unicode code points.

Right, which is probably UTF-8 encoded bytes. If I have to convert them
all to a series of unicode codepoints which is more expensive. If I have
bytes to begin with I have to check if the values at the start/end of
the range are valid UTF-8 (one of the nice properties of UTF-8 is you
can know if you're at the start of a character without parsing the
whole string) instead of having to convert everything up to the end.
Then I can ignore all the bits in the middle and deal with them later
outside of the hot path if/when I convert it to a string or whatever
for display.


> Hence I think your proposal should use code points instead. And then,
> if I am not mistaken, your proposal matches my proposal for
> opportunistic interoperability as fallback.

You may be right that it's the same as far as fallback goes. I suspect
that more things will have a UTF-8 to whatever they are conversion than
a UTF-32 to whatever they are conversion, but to be fair I have no
proof for that.

Out of curiosity, can you provide an example of an XML decoder that can
*only* give you an instance of a UTF-32 string (or whatever the
language/OS uses)? I can give plenty (the Go one for starters) where you
only get bytes out and it's up to you to figure out what to do with
them. I *could* convert those to a UTF- 32 slice, but that would be
unnecessary and expensive in a language designed for performance whereas
if it's a language that's doing implicit conversion to its own thing
it's already doing implicit work and probably isn't optimizing for the
kind of fast-path performance I'd like to get.

I think I should simplify my argument to: most things use UTF-8 or at least can 
convert from UTF-8 so we should too. Using codepoints is effectively using 
UTF-32, which most things [citation needed] don't use by default.

—Sam
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Florian Schmaus

On 12/4/20 7:29 PM, Sam Whited wrote:

I don't understand this, if you get out bytes why would they be
different to what was in the stream?


Often you don't get raw bytes from your XML parser, but an instance of 
your programming language's native String type. But often your 
programming language provides an API to encode that String to UTF-8 
encoded bytes, which *should* match exactly the bytes on the wire.


My problem with your proposal is that it uses bytes. I don't get why you 
want to use bytes here. You most certainly will obtain from your XML 
parser a type that can be converted to a sequence of Unicode code points.


Hence I think your proposal should use code points instead. And then, if 
I am not mistaken, your proposal matches my proposal for opportunistic 
interoperability as fallback.


- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited
I don't understand this, if you get out bytes why would they be
different to what was in the stream? If you get a string in a language
that assumes strings have some specific format (ie. are valid UTF-8 or
UTF-16 or something) it makes sense that they might have had to be
different, but would anything change a raw byte slice before handing it
to you? That seems like a recipe for disaster that we can't (and
shouldn't) work around at a protocol level unless I'm seriously
misunderstanding something.

—Sam

On Fri, Dec 4, 2020, at 17:09, Kevin Smith wrote:
> Except that bytes are making significant assumptions about the
> libraries and languages being used. It’s assuming that what you get
> out of your parser corresponds to the same bytes that were on the
> stream, which seems particularly unlikely in languages that aren’t C
> at heart (C, C++, Go…).
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Kevin Smith
On 4 Dec 2020, at 16:41, Sam Whited  wrote:
> Bytes are the only way to not make assumptions about the libraries,
> languages, etc. being used.

Except that bytes are making significant assumptions about the libraries and 
languages being used. It’s assuming that what you get out of your parser 
corresponds to the same bytes that were on the stream, which seems particularly 
unlikely in languages that aren’t C at heart (C, C++, Go…).

/K
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited
On Fri, Dec 4, 2020, at 16:10, Florian Schmaus wrote:
> XMPP uses Unicode because XML, upon which XMPP is build, uses Unicode,
> hence I doubt that you will ever find an API where e.g.
> Message.getBody() will return data that is not Unicode encoded, but
> uses some other encoding scheme.

Wasn't that what you were saying? It might be UTF-16 on a JavaScript
implementation, or it might be  in a future where Unicode
is no longer the dominant way of representing characters, or in an east
asian protocol where Unicode isn't always used (again, I have no
examples of this, I've just been told that other encodings are
frequently used in China and thereabouts, but I can't verify this).

Even in languages that do use Unicode not all of them provide easy
access to codepoints. The language itself may not support UTF-8
directly, for example and always return bytes at which point the user
would have to load a UTF-8 package and parse the runes out. Or it may be
using a general purpose XML encoder that checks the XML heading to find
the character type. At the application level we know that it will always
be UTF-8, but the XML library doesn't know that so it always returns
bytes (this one is a real example that I deal with a lot).



> So, I am sorry, but I do not see your point. Furthermore, the Strings
> of all modern programming languages, I am aware of, allow you to
> derive the Unicode code points they consist of. And from those code
> points one can derive grapheme clusters.

People implement XMPP in languages that aren't modern too. That's
partially a joke, but jokes aside you're assuming the language will have
some form of encoded string. Like the XML library I gave before I can
imagine many languages won't return a string at all and will always
return bytes. Even if the language has UTF-8 encoded strings we might
want to return bytes for efficiency (bytes can be appended to an
existing buffer that gets reused, strings are immutable and therefore
require expensive allocations). In Go in particular which I write a lot
of I expect this to be the case (I would want to return bytes to give
the user the option of figuring out what to do with them).

Bytes are the only way to not make assumptions about the libraries,
languages, etc. being used.

—Sam
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Florian Schmaus

On 12/4/20 4:01 PM, Sam Whited wrote:

On Fri, Dec 4, 2020, at 14:50, Florian Schmaus wrote:

But this String will be represented in your programming language's
native String representation, which may or may not match the bytes on
the wire.


That's the point, we can't guarantee what the representation is. > …

> it might be one of the various east Asian encodings that are still
> popular (or so I've been told).

XMPP uses Unicode because XML, upon which XMPP is build, uses Unicode, 
hence I doubt that you will ever find an API where e.g. 
Message.getBody() will return data that is not Unicode encoded, but uses 
some other encoding scheme.


So, I am sorry, but I do not see your point. Furthermore, the Strings of 
all modern programming languages, I am aware of, allow you to derive the 
Unicode code points they consist of. And from those code points one can 
derive grapheme clusters.


- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited
On Fri, Dec 4, 2020, at 14:50, Florian Schmaus wrote:
> But this String will be represented in your programming language's
> native String representation, which may or may not match the bytes on
> the wire.

That's the point, we can't guarantee what the representation is. It
might be something where codepoints makes sense, or it might be one of
the various east Asian encodings that are still popular (or so I've been
told). All of them you can probably figure out how many bytes it would
take to represent the string, but you don't necessarily want to convert
from codepoints to some non-Unicode thing or to some future
representation.

—Sam

-- 
Sam Whited
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Florian Schmaus

On 12/4/20 3:27 PM, Sam Whited wrote:

FWIW I was a big proponent of doing it this way too, but I've changed my
mind after seeing too many grapheme segmentation implementations be
broken in small, different, ways. My new position is that we have to
just count bytes and figure out a sane behavior in case someone sends us
an invalid offset in the middle of a codepoint or something. This is
encoding agnostic (not that it matters for XMPP) and makes it very easy
to count: go to that byte offset, check if we're on any sort of UTF-8
boundary, if so call it a day, if not do whatever the fallback is.


This also reads like it is mixing multiple independent layers, i.e. the 
bytes on the wire with the data you receive in the higher layers, e.g. 
your XMPP API may provide a method Message.getBody(), which returns a 
String. But this String will be represented in your programming 
language's native String representation, which may or may not match the 
bytes on the wire.


As I do not know any alternative, grapheme cluster counting is the only 
sound way for interoperability and does not exclude our friends from all 
over the world and their characters. Which is important to me.


However, I have a counter proposal that goes into a similar direction as 
yours: Even if the specification asks for grapheme clusters, there is 
nothing wrong to fallback to character counting if you haven't 
implemented grapheme cluster counting (yet). I would expect that it will 
just work most of the time (for users of the arabic alphabet).


While this does in no way allow for sound interoperability, it is some 
sort of opportunistic interoperability.


- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Tedd Sterr

> FWIW I was a big proponent of doing it this way too, but I've changed my
> mind after seeing too many grapheme segmentation implementations be
> broken in small, different, ways. My new position is that we have to
> just count bytes and figure out a sane behavior in case someone sends us
> an invalid offset in the middle of a codepoint or something. This is
> encoding agnostic (not that it matters for XMPP) and makes it very easy
> to count: go to that byte offset, check if we're on any sort of UTF-8
> boundary, if so call it a day, if not do whatever the fallback is.

Codepoints are preferable: 
https://mail.jabber.org/pipermail/standards/2019-October/036589.html
If you're indexing by clusters then you're just asking for trouble.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Sam Whited
FWIW I was a big proponent of doing it this way too, but I've changed my
mind after seeing too many grapheme segmentation implementations be
broken in small, different, ways. My new position is that we have to
just count bytes and figure out a sane behavior in case someone sends us
an invalid offset in the middle of a codepoint or something. This is
encoding agnostic (not that it matters for XMPP) and makes it very easy
to count: go to that byte offset, check if we're on any sort of UTF-8
boundary, if so call it a day, if not do whatever the fallback is.

—Sam

On Fri, Dec 4, 2020, at 14:15, Florian Schmaus wrote:
> Reply containing rant about how unpractical grapheme cluster counting
> is in 3, 2, 1… :)
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Florian Schmaus

On 12/4/20 3:03 PM, Andrew Nenakhov wrote:

Upping a year-old email thread for Florian.


Thanks, but I am well aware of the thread and the situation.

I think this below mixes aspects the XML layer with the Unicode layer, 
which do not have to get mixed when counting "characters". Ultimately 
what you get out of the textual representation of the  element is 
a sequence of grapheme clusters (identified via extended grapheme 
clustering algorithm). Those are the entities that eventually should get 
counted.


Reply containing rant about how unpractical grapheme cluster counting is 
in 3, 2, 1… :)


- Florian



ср, 18 дек. 2019 г. в 20:41, Marvin W :


[inline]

On 12/18/19 3:22 PM, Andrew Nenakhov wrote:

In the end we have settled for counting characters of escaped string, so


This sounds like a terrible idea. In encoded XML, ">", "", ""
and "]]>" are equivalent. I just tried it out and servers
indeed do convert all of those to their shortest well-formed variant
(which is "") so you cannot rely on their reference length at all.
Servers may at their discretion convert non-ascii characters to their
character reference form (starting with 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-04 Thread Andrew Nenakhov
Upping a year-old email thread for Florian.

ср, 18 дек. 2019 г. в 20:41, Marvin W :
>
> [inline]
>
> On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
> > In the end we have settled for counting characters of escaped string, so
>
> This sounds like a terrible idea. In encoded XML, ">", "", ""
> and "]]>" are equivalent. I just tried it out and servers
> indeed do convert all of those to their shortest well-formed variant
> (which is "") so you cannot rely on their reference length at all.
> Servers may at their discretion convert non-ascii characters to their
> character reference form (starting with 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Jonas Schäfer
On Mittwoch, 18. Dezember 2019 17:27:04 CET Jonas Schäfer wrote:
> On Mittwoch, 18. Dezember 2019 16:40:42 CET Marvin W wrote:
> > [inline]
> > 
> > On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
> > > In the end we have settled for counting characters of escaped string, so
> > 
> > This sounds like a terrible idea. In encoded XML, ">", "", ""
> > and "]]>" are equivalent. I just tried it out and servers
> > indeed do convert all of those to their shortest well-formed variant
> > (which is "") so you cannot rely on their reference length at all.
> > Servers may at their discretion convert non-ascii characters to their
> > character reference form (starting with 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Jonas Schäfer
On Mittwoch, 18. Dezember 2019 16:40:42 CET Marvin W wrote:
> [inline]
> 
> On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
> > In the end we have settled for counting characters of escaped string, so
> 
> This sounds like a terrible idea. In encoded XML, ">", "", ""
> and "]]>" are equivalent. I just tried it out and servers
> indeed do convert all of those to their shortest well-formed variant
> (which is "") so you cannot rely on their reference length at all.
> Servers may at their discretion convert non-ascii characters to their
> character reference form (starting with 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
*counting symbols differently.

сб, 21 дек. 2019 г. в 17:22, Andrew Nenakhov <
andrew.nenak...@redsolution.com>:

> сб, 21 дек. 2019 г. в 17:12, Ralph Meijer :
>
>> So, having unescaped > is valid for case 2, and serializers may choose to
>> do so.
>>
>
> Okay, whatever. We are already counting messages and escaping symbols all
> the time (cause servers to escape them anyway). It's far from being the
> first  thing we do differently. We'll probably change namespace for our
> references to not interfere with whatever you guys come up with.
>
> --
> Andrew Nenakhov
> CEO, redsolution, OÜ
> https://redsolution.com 
>


-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
сб, 21 дек. 2019 г. в 17:12, Ralph Meijer :

> So, having unescaped > is valid for case 2, and serializers may choose to
> do so.
>

Okay, whatever. We are already counting messages and escaping symbols all
the time (cause servers to escape them anyway). It's far from being the
first  thing we do differently. We'll probably change namespace for our
references to not interfere with whatever you guys come up with.

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Ralph Meijer
On December 21, 2019 12:32:03 PM GMT+01:00, Andrew Nenakhov 
 wrote:
>сб, 21 дек. 2019 г. в 16:21, Ralph Meijer :
>
>> Just making sure everyone has the same interpretation:
>>
>> Case 1) The text has the sequence ]]>. In this case, in XML the >
>MUST be
>> escaped (with , or equivalent character reference).
>> Case 2) All occurances of > not preceded by ]]. Here > MAY appear
>as-is,
>> or escaped. Both are valid.
>>
>
>Well. We diverge here, and read it differently. MUST be escaped clause
>uses
>AND, it's is not optiona. The reason it MUST be escaped is _for
>compatibility_, and we are in a compatibility game, aren't we?

If this were the case, there'd be no reason for having the 'may' earlier in the 
sentence. The compatibility clause refers to case 1 above. FWIW, it would be 
entirely possible to detect when you're in a CDATA section or not, but the 
authors chose to make it explicit that you must escape  > for this case. I am 
going to assume this is an artifact of XML's SGML ancestry and this rule is to 
make parsing easier.

So, having unescaped > is valid for case 2, and serializers may choose to do so.


-- 
Cheers,

ralphm
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Florian Schmaus
On 21.12.19 12:32, Andrew Nenakhov wrote:
> 
> 
> сб, 21 дек. 2019 г. в 16:21, Ralph Meijer  >:
> 
> Just making sure everyone has the same interpretation:
> 
> Case 1) The text has the sequence ]]>. In this case, in XML the >
> MUST be escaped (with , or equivalent character reference).
> Case 2) All occurances of > not preceded by ]]. Here > MAY appear
> as-is, or escaped. Both are valid.
> 
> 
> Well. We diverge here, and read it differently. MUST be escaped clause
> uses AND, it's is not optiona. The reason it MUST be escaped is _for
> compatibility_, and we are in a compatibility game, aren't we?
> 
> For argument's sake, can you provide examples of XML processing
> libraries that work the way you describe and do not escape > all the
> time? We know none such, and we've tested dozens of them over the many
> years. Every single one always did the escaping. As I think it should,
> because consistency.

Smack does not escape all the time:
https://github.com/igniterealtime/Smack/blob/9d626bf787dc3e0e0a4399cef429285b22744d73/smack-core/src/main/java/org/jivesoftware/smack/util/StringUtils.java#L194

Also xmllint says that '>' in text is well-formed.

$ echo ">" | xmllint --noout -

$ echo "<" | xmllint --noout -
-:1: parser error : StartTag: invalid element name
<
  ^
- Flrian



signature.asc
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
сб, 21 дек. 2019 г. в 16:21, Ralph Meijer :

> Just making sure everyone has the same interpretation:
>
> Case 1) The text has the sequence ]]>. In this case, in XML the > MUST be
> escaped (with , or equivalent character reference).
> Case 2) All occurances of > not preceded by ]]. Here > MAY appear as-is,
> or escaped. Both are valid.
>

Well. We diverge here, and read it differently. MUST be escaped clause uses
AND, it's is not optiona. The reason it MUST be escaped is _for
compatibility_, and we are in a compatibility game, aren't we?

For argument's sake, can you provide examples of XML processing libraries
that work the way you describe and do not escape > all the time? We know
none such, and we've tested dozens of them over the many years. Every
single one always did the escaping. As I think it should, because
consistency.

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Ralph Meijer
On December 21, 2019 11:57:19 AM GMT+01:00, Andrew Nenakhov 
 wrote:
>сб, 21 дек. 2019 г. в 15:45, Philipp Hörist :
>
>>
>> I think you misunderstood the RFC, it's not a violation to send ">"
>> unescaped.
>>
>> > The right angle bracket (>) *may *be represented using the string "
>
>> ", and *MUST*, for compatibility
>> , be escaped using either "
>
>> " or a character reference *when *it appears in the string " ]]> " in
>> content, when that string is not marking the end of a CDATA section
>> .
>>
>>
>I have a different reading of this.
>
>MUST be escaped using
>EITHER 
>OR  character reference (WHEN it appears in the string ... ...)
>
>so OR branch is clearly used only for case listed in WHEN

Just making sure everyone has the same interpretation:

Case 1) The text has the sequence ]]>. In this case, in XML the > MUST be 
escaped (with , or equivalent character reference).
Case 2) All occurances of > not preceded by ]]. Here > MAY appear as-is, or 
escaped. Both are valid.

-- 
Cheers,

ralphm
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
сб, 21 дек. 2019 г. в 15:45, Philipp Hörist :

>
> I think you misunderstood the RFC, it's not a violation to send ">"
> unescaped.
>
> > The right angle bracket (>) *may *be represented using the string " 
> ", and *MUST*, for compatibility
> , be escaped using either " 
> " or a character reference *when *it appears in the string " ]]> " in
> content, when that string is not marking the end of a CDATA section
> .
>
>
I have a different reading of this.

MUST be escaped using
EITHER 
OR  character reference (WHEN it appears in the string ... ...)

so OR branch is clearly used only for case listed in WHEN

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Philipp Hörist
Am Sa., 21. Dez. 2019 um 11:39 Uhr schrieb Andrew Nenakhov <
andrew.nenak...@redsolution.com>:

>
> We assumed as much but weren't sure. Anyway, Marvin had sent a malformed
> stanza, which was corrected (escaped) by the server. Next, a client that
> counted characters in a different way than he did (which was known
> beforehand) counted them differently.  Next he complains he didn't get the
> result he expected.
>
> The only thing I'm surprised is that the server didn't just drop the
> connection, as it does when receiving unescaped < symbols
>
>
I think you misunderstood the RFC, it's not a violation to send ">"
unescaped.

> The right angle bracket (>) *may *be represented using the string " 
", and *MUST*, for compatibility ,
be escaped using either "  " or a character reference *when *it appears
in the string " ]]> " in content, when that string is not marking the end
of a CDATA section .


Regards
Philipp
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
пт, 20 дек. 2019 г. в 20:34, Dave Cridland :

>
> I think we've just conclusively proven it does get changed during sending.
> We certainly cannot rely on it not being changed, since absolutely nothing
> in XML or XMPP prevents it being changed.
>

If you form the stanza according to the standard before sending, it does
not need changing on server side. So far we have exactly zero issues with
our way of counting characters, on any servers. We also have a strict XML
document sent to the server, where referenes are pointing to position
within it, not to some abstract representation of content.


-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
сб, 21 дек. 2019 г. в 14:53, Florian Schmaus :

> On 21.12.19 10:50, Andrew Nenakhov wrote:
> >
> >
> > пт, 20 дек. 2019 г. в 19:25, Marvin W  > >:
> >
> > On 12/20/19 1:15 PM, Andrew Nenakhov wrote:
> > > You have sent a string '>', which was escaped to
> > > '' before sending to the server.
> >
> > I have sent ">" verbatim (exactly the stanza I send you in the
> last
> > mail was what went (TLS encrypted) to the server. According to XML
> > standard "the ampersand character (&) and the left angle bracket (<)
> > must not appear in their literal form" [1], but nothing is wrong with
> > having > in literal form (if it doesn't appear after "]]" in which
> case
> > it has to be replaced with a reference).
> >
> >
> > Before we proceed any further, could you please clarify what exactly you
> > mean by 'I have sent ... verbatim' ?
>
> This was the string presented to the layers below the XML layers, e.g.
> TLS, to be put on the wire.
>

We assumed as much but weren't sure. Anyway, Marvin had sent a malformed
stanza, which was corrected (escaped) by the server. Next, a client that
counted characters in a different way than he did (which was known
beforehand) counted them differently.  Next he complains he didn't get the
result he expected.

The only thing I'm surprised is that the server didn't just drop the
connection, as it does when receiving unescaped < symbols

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Ralph Meijer
On December 21, 2019 10:57:02 AM GMT+01:00, Florian Schmaus  
wrote:
>On 18.12.19 16:00, Marvin W wrote:
>> It's indeed a good question if anything in XMPP allows servers or
>> in-between entities to do normalization. I was under the assumption
>that
>> servers do not change the codepoints. In XML [1] Characters with
>> multiple possible representations in ISO/IEC 10646 (e.g. characters
>with
>> both precomposed and base+diacritic forms) match only if they have
>the
>> same representation in both strings. Thus by XML specification,
>> normalization is changing the body.
>
>I am not sure if it is not a little bit far fetched to deduce from the
>XML "string match" definition that XMPP entities are not provided with
>a
>little bit of freedom to transform Unicode string representation within
>a certain degree. At least I am currently missing the link from the XML
>"string match" definition to "XMPP entities must use this when
>serializing/de-serializing XML".
>
>If we can make that link, then we do not need normalization. And we
>probably want to clearly state that requirement in rfc6120bis, because
>it is not obvious (at least for me).

I'd be quite sad if the character data would be normalized/canonicalized. Also 
I haven't seen this anywhere in XMPP implementations outside of JID matching.


>> Also the main reason why we shouldn't ask for Unicode normalization
>to
>> happen is that different Unicode version have different
>normalizations.> Thus if the sender normalizes with Unicode version X
>and calculates
>> offsets from that, then receiver normalizes with Unicode version Y
>and
>> determines the offsets there, they can end up in pointing to
>different
>> characters.
>
>We need Unicode agility anyway in XMPP, which I do not believe to be a
>big issue. Especially since Unicode is likely to introduce lesser
>changes with every future standard version.

Quite (except for JIDs). This is also a reason why for example Grapheme Cluster 
counting would bring us a world of pain.


-- 
Cheers,

ralphm
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Florian Schmaus
On 18.12.19 16:00, Marvin W wrote:
> It's indeed a good question if anything in XMPP allows servers or
> in-between entities to do normalization. I was under the assumption that
> servers do not change the codepoints. In XML [1] Characters with
> multiple possible representations in ISO/IEC 10646 (e.g. characters with
> both precomposed and base+diacritic forms) match only if they have the
> same representation in both strings. Thus by XML specification,
> normalization is changing the body.

I am not sure if it is not a little bit far fetched to deduce from the
XML "string match" definition that XMPP entities are not provided with a
little bit of freedom to transform Unicode string representation within
a certain degree. At least I am currently missing the link from the XML
"string match" definition to "XMPP entities must use this when
serializing/de-serializing XML".

If we can make that link, then we do not need normalization. And we
probably want to clearly state that requirement in rfc6120bis, because
it is not obvious (at least for me).

> Also the main reason why we shouldn't ask for Unicode normalization to
> happen is that different Unicode version have different normalizations.> Thus 
> if the sender normalizes with Unicode version X and calculates
> offsets from that, then receiver normalizes with Unicode version Y and
> determines the offsets there, they can end up in pointing to different
> characters.

We need Unicode agility anyway in XMPP, which I do not believe to be a
big issue. Especially since Unicode is likely to introduce lesser
changes with every future standard version.

- Florian



signature.asc
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Florian Schmaus
On 21.12.19 10:50, Andrew Nenakhov wrote:
> 
> 
> пт, 20 дек. 2019 г. в 19:25, Marvin W  >:
> 
> On 12/20/19 1:15 PM, Andrew Nenakhov wrote:
> > You have sent a string '>', which was escaped to
> > '' before sending to the server.
> 
> I have sent ">" verbatim (exactly the stanza I send you in the last
> mail was what went (TLS encrypted) to the server. According to XML
> standard "the ampersand character (&) and the left angle bracket (<)
> must not appear in their literal form" [1], but nothing is wrong with
> having > in literal form (if it doesn't appear after "]]" in which case
> it has to be replaced with a reference).
> 
> 
> Before we proceed any further, could you please clarify what exactly you
> mean by 'I have sent ... verbatim' ? 

This was the string presented to the layers below the XML layers, e.g.
TLS, to be put on the wire.

- Florian



signature.asc
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Andrew Nenakhov
пт, 20 дек. 2019 г. в 19:25, Marvin W :

> On 12/20/19 1:15 PM, Andrew Nenakhov wrote:
> > You have sent a string '>', which was escaped to
> > '' before sending to the server.
>
> I have sent ">" verbatim (exactly the stanza I send you in the last
> mail was what went (TLS encrypted) to the server. According to XML
> standard "the ampersand character (&) and the left angle bracket (<)
> must not appear in their literal form" [1], but nothing is wrong with
> having > in literal form (if it doesn't appear after "]]" in which case
> it has to be replaced with a reference).
>

Before we proceed any further, could you please clarify what exactly you
mean by 'I have sent ... verbatim' ?

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-21 Thread Florian Schmaus
On 18.12.19 15:22, Andrew Nenakhov wrote:
> We're totally onboard with this XEP, and it is, in fact, the way we
> already do count characters for references in all versions of Xabber.
> 
> However, there is one important case not addressed in this XEP: XML
> predefined entities.

As others have already pointed out, this happens on a different lower
layer. The XML wire-fromat is not relevant here, what is relevant is the
string your XML parser outputs.

- Florian



signature.asc
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Tedd Sterr
A few thoughts…

If we consider character-range indices as referring to the points between 
characters, not the positions of the characters themselves, then there's no 
confusion over whether a character should be included - a character is either 
inside the range or outside of it.

XML is a representation of the document and its content, it is not the content 
itself; similarly, UTF-8 is a representation of the text, not the text itself - 
in both cases, if you want the content then you must decode the representation 
first. This means references to content must be made regarding the decoded 
version.

For messages of the form "/me …", references must be to this version, before 
any further transformations (such as inserting the nickname.) There's no 
guarantee that the nickname you intend is the same as the one which will be 
used, and could thus have a different length, e.g. the recipient has set a 
custom nickname for you, or you send that message while the recipient is 
offline and then change your nickname before it's received.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Dave Cridland
On Fri, 20 Dec 2019 at 14:43, Andrew Nenakhov <
andrew.nenak...@redsolution.com> wrote:

>
>
> пт, 20 дек. 2019 г. в 17:53, Dave Cridland :
>
>>
>>
>> On Fri, 20 Dec 2019 at 12:15, Andrew Nenakhov <
>> andrew.nenak...@redsolution.com> wrote:
>>
>>> You have sent a string '>', which was escaped to
>>> '' before sending to the server.
>>>
>>
>> Well, maybe. XML doesn't require you to escape '>' in text, only in
>> attribute values.
>>
>
> I must be using different XML from you. Documentation for the version we
> are using is here: https://www.w3.org/TR/REC-xml/#syntax
>
> Quote:
>
>> The ampersand character (&) and the left angle bracket (<) *MUST NOT*
>> appear in their literal form, except when used as markup delimiters, or
>> within a comment , a processing
>> instruction , or a CDATA section
>> . If they are needed
>> elsewhere, they *MUST* be escaped
>>  using either numeric
>> character references  or the
>> strings "  " and "  " respectively. The right angle bracket (>)
>> may be represented using the string "  ", and *MUST*, for
>> compatibility , be escaped
>> using either "  " or a character reference when it appears in the
>> string " ]]> " in content, when that string is not marking the end of a CDATA
>> section .
>>
>
> I don't see any exceptions that allow '>' in XML.
>
>
Unless I'm missing something obvious, this says you can use > unescaped
everywhere except for the explicit case of "]]>". So again, "" as text
content can be sent as-is, and requires no escaping. (But I was wrong about
the attribute value case, where you can, in fact, send ">" unescaped
also).


>
>> Presumably, in order to calculate the referencing, one would need to know
>> precisely how this string was to be serialized? Does that mean it needs
>> to... what? Hardcode that knowledge based on the library used? Seems
>> astonishingly fragile, especially if you're working in an environment where
>> the XML serialization is provided by the platform. Like a web browser.
>>
>
> So far we managed it rather well on four different platforms with five
> lauguages. This way we have precise references to resulting stanza text.
> Not some 'ideal' 'abstract' unicode string, but to a formed piece of XML
> document, that's not going to be changed or modified anymore before
> sending. This is the most stable way to solve this problem.
>

I think we've just conclusively proven it does get changed during sending.
We certainly cannot rely on it not being changed, since absolutely nothing
in XML or XMPP prevents it being changed.

Dave.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Andrew Nenakhov
пт, 20 дек. 2019 г. в 17:53, Dave Cridland :

>
>
> On Fri, 20 Dec 2019 at 12:15, Andrew Nenakhov <
> andrew.nenak...@redsolution.com> wrote:
>
>> You have sent a string '>', which was escaped to
>> '' before sending to the server.
>>
>
> Well, maybe. XML doesn't require you to escape '>' in text, only in
> attribute values.
>

I must be using different XML from you. Documentation for the version we
are using is here: https://www.w3.org/TR/REC-xml/#syntax

Quote:

> The ampersand character (&) and the left angle bracket (<) *MUST NOT*
> appear in their literal form, except when used as markup delimiters, or
> within a comment , a processing
> instruction , or a CDATA section
> . If they are needed
> elsewhere, they *MUST* be escaped
>  using either numeric character
> references  or the strings "
>  " and "  " respectively. The right angle bracket (>) may be
> represented using the string "  ", and *MUST*, for compatibility
> , be escaped using either " 
> " or a character reference when it appears in the string " ]]> " in
> content, when that string is not marking the end of a CDATA section
> .
>

I don't see any exceptions that allow '>' in XML.


> Presumably, in order to calculate the referencing, one would need to know
> precisely how this string was to be serialized? Does that mean it needs
> to... what? Hardcode that knowledge based on the library used? Seems
> astonishingly fragile, especially if you're working in an environment where
> the XML serialization is provided by the platform. Like a web browser.
>

So far we managed it rather well on four different platforms with five
lauguages. This way we have precise references to resulting stanza text.
Not some 'ideal' 'abstract' unicode string, but to a formed piece of XML
document, that's not going to be changed or modified anymore before
sending. This is the most stable way to solve this problem.


-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Marvin W

On 12/20/19 1:15 PM, Andrew Nenakhov wrote:
You have sent a string '>', which was escaped to 
'' before sending to the server.


I have sent ">" verbatim (exactly the stanza I send you in the last 
mail was what went (TLS encrypted) to the server. According to XML 
standard "the ampersand character (&) and the left angle bracket (<) 
must not appear in their literal form" [1], but nothing is wrong with 
having > in literal form (if it doesn't appear after "]]" in which case 
it has to be replaced with a reference).


Apparently either your server or your client silently replaced the 
character with a reference (I could probably do the same in the other 
direction). I also think this is completely fine, because changing ">" 
to "" does not change the XML document - again, they are the same in 
XML, so they should be the same in XMPP as well.


To me, it works as designed - a sending entity had sent an incorrect 
reference and predictably Xabber for Web worked displaying it as it should.


I totally understand why this happened (I intentionally produced this, 
because I know that many XML serializers do indeed serialize ">" as 
"" even when it is not required).


The underlying reason why this happened is that your "standard" has 
flaws. And I wrote this ProtoXEP to ensure there is one source of truth 
regarding character counting so that such flaws don't happen again. I 
will certainly update it to make sure everyone understands that ">" is 
to be counted as 1 character and not 4.


It is true, we're not really good at 
writing formal XEPs, in part because we're extremely busy building real 
products that work.


I wrote this ProtoXEP because I wanted to build real products and felt 
that this need to be clarified. We need formal XEPs so that the real 
product is actually compatible with other real products in the same 
federated network and not cause issues with each other. If they are not, 
they at most qualify as a real broken product.


[1] https://www.w3.org/TR/REC-xml/#syntax
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Dave Cridland
On Fri, 20 Dec 2019 at 12:15, Andrew Nenakhov <
andrew.nenak...@redsolution.com> wrote:

> You have sent a string '>', which was escaped to
> '' before sending to the server.
>

Well, maybe. XML doesn't require you to escape '>' in text, only in
attribute values.

Presumably, in order to calculate the referencing, one would need to know
precisely how this string was to be serialized? Does that mean it needs
to... what? Hardcode that knowledge based on the library used? Seems
astonishingly fragile, especially if you're working in an environment where
the XML serialization is provided by the platform. Like a web browser.

Dave
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Andrew Nenakhov
пт, 20 дек. 2019 г. в 00:49, Marvin W :

> So I tried with Xabber/xabber.org and either your server or the client
> (I guess it's the server) seems to fail to properly do what you just
> said it should: When sending the message
>
> 
>>
> type='markup'>
> type='markup'>
> 
>
> it is displayed as
>
> 
>
> with g and ; in bold.
>

Let's see what happened (btw, xabber.org currenlty uses stock ejabberd
server):

You have sent a string '>', which was escaped to ''
before sending to the server.
Xabber for Web (you weren't really clear what you used but I assume it was
it) then took the string '' and applied references to
it, turning it into this:
'>'

Then, it was correctly rendered like this (i have highlighed bold
characters for better visibility):
&*g*t*;*

To me, it works as designed - a sending entity had sent an incorrect
reference and predictably Xabber for Web worked displaying it as it should.


I guess we have different definitions of a standard. These mish-mash of
> different XEPs is a publicly viewable standard proposal. I am not aware
> of a documentation of what Xabber is doing
>

We have good enough internal docs. It is true, we're not really good at
writing formal XEPs, in part because we're extremely busy building real
products that work.

> Well. I strongly object.
>
> Either we need to change the text in XEP-372 slightly or we have to
> change the examples in XEP-372 and the text and examples in XEP-394
> (because both should do the same). I see you have a strong opinion on
> the one side for some reason.
>

394 does not even use same semantics that 372 use, so I would not even call
them related.

Sure, we could deprecate XEP-394, but I don't see a proper replacement
> for it yet.


I've sent our rather complete proposal (sans formal text, just stanzas) to
this list somewhere around summer.

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Ralph Meijer


On 20-12-2019 12:55, Andrew Nenakhov wrote:



чт, 19 дек. 2019 г. в 19:02, Ralph Meijer >:


If you want consistent counting on all platforms and languages,
counting
Unicode characters seems to be the best way forward.


We do not dispute that 'counting unicode characters seems the best way 
forward'. However, we do dispute when to count them. It's more of a 
preference issue, but we chose to count characters in the XML doc we 
send, because XML standard is common for any platform and language.


Just to be clear. An XML Stream is encoded in UTF-8 and has additional 
processing (like entities) to represent a text. While does series of 
UTF-8 encoded characters are themselves also represent a sequence of 
Unicode characters (let's call them seq1), that sequence is not 
necessarily equivalent to the abstract sequence of characters that 
represents the above mentioned text (seq2).


Counting in seq1 and seq2 are different things as soon as there a CDATA 
sections, entities, etc, and I consider counting seq1 to be the wrong 
approach. I.e. I expect the character count for the text in the body 
element of the following equivalent XML snippets to be exactly 1 (the 
sequence containing the single character U+003c), and not 4, 5, 9, or 
13, irregardless of where you choose to count:


  
  
  
  

--
ralphm
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-20 Thread Ralph Meijer

Oops, the following should have been sent to the list.

On 19-12-2019 15:02, Ralph Meijer wrote:



On 19-12-2019 13:59, Andrew Nenakhov wrote:
ср, 18 дек. 2019 г. в 20:12, Ralph Meijer >:


    My assumption was that we are looking at character data on the
    abstract layer /after/ parsing XML. You shouldn't see entities there
    (they'd be resolved to their respective characters), nor should you
    see , and request the text for 
the `blah` node, I get an object that encodes the abstract sequence of 
characters: `less < more`. In Python, for example, that'd be 
represented by a unicode string object.


See also https://www.unicode.org/versions/Unicode12.1.0/ch03.pdf#G2212 
for various definitions around characters, code points, glyphs, 
graphemes, and the like. So yes, you'd be counting ZWJs and such for 
your example, and I think it tallies up to 7 for just man/man/boy/boy, 
without Fitzpatrick modifiers, hair variations, hair color, direction.


With regards to having to re-encode for HTML representation, as 
unfortunate that may be, other situations require other 
transformations, like encoded in UTF-8, for them to be used in other 
systems (UI, storage, etc.).


If you want consistent counting on all platforms and languages, 
counting Unicode characters seems to be the best way forward.



--
ralphm

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-19 Thread Marvin W

On 12/19/19 1:59 PM, Andrew Nenakhov wrote:

Is it really any better than escaped XML text?


Yes. Any sane implementation of XML parsers would resolve references as 
part of the parsing, so you would have to do extra work to find out what 
references were in the text before.


Plus, when doing the web client this means an additional 
escaping - deescaping routine every time when something is 
sent-displayed, cause browsers require their own escaping.


I hope that any web client would not use innerHtml or similar techniques 
to display the message body, but instead rely on 
document.createTextNode() which expects a string without references. 
Similarly inputElement.value and element.textContent give you their 
strings without references. In generally HTML/JS do their best to 
abstract away from references, because why should an application 
developer deal with that?


Also HTML uses a different set of predefined references then XML and has 
different requirements -  is valid in HTML but not in XML (without 
it being defined as an entity in a DTD).


Why should standard be concerned about different server implementations 
converting anything?  If a server does some converting for some reason 
from one way of escaping XML to another, of course it should recalculate 
all references.


On the XML layer (which is what XMPP build on) this "conversion" does 
not change anything (the texts stay the same), that's why it is 
perfectly valid for a server to do it. The protocol on top of XML (and 
subsequently XMPP) should not deal with references, they are resolved on 
the layer below. That's why it is a bad idea to assume specific 
characters to be represented using certain references, because you can't 
control that (you can only assume things).


So I tried with Xabber/xabber.org and either your server or the client 
(I guess it's the server) seems to fail to properly do what you just 
said it should: When sending the message



  >
  type='markup'>
  type='markup'>



it is displayed as



with g and ; in bold.


So far our 'non-standard' way of using 
references is in fact way more 'standard' than what is currently 
suggested by this mish-mash of different XEPs.


I guess we have different definitions of a standard. These mish-mash of 
different XEPs is a publicly viewable standard proposal. I am not aware 
of a documentation of what Xabber is doing


Not really cool, right? 


What's bad about that? I would say that having "0..0 bold" is pretty 
weird, because it sounds like an empty range (it starts and ends at the 
same point, so it must be empty).




The second integer represents the location of the first non-URL
character occurring after the URL *(or the end of the string if the
URL is the last part of the Tweet text)*



I think you are misunderstanding them here. I am pretty sure "the end of 
the string" is *after* the last character, not the last character.


Cited example of programming languages is valid only in part. Yes, it is 
so in java or python, but not so in swift, obj-c or erlang. The last 
three use index of the first character and length, which is  actually my 
favourite approach.


I don't think it really makes sense to discuss which programming 
language is the one that matters most, but:

- Swift has two operators "ABCDE"[2...4] = "CDE" and "ABCDE"[2..<4] = "CD"
- Objective-C substring functions require index and length
- Erlang uses 1-based indices, string:sub_string("ABCDE", 2, 4) = "BCD", 
thus is equivalent to python [1:4]


Also when you prefer index of first char and length, why not use begin="2" length="2" /> then? For languages that take string length, you 
currently have to calculate length = end+1-begin (because you chose to 
have end one less than everyone else does).




ср, 18 дек. 2019 г. в 21:59, Marvin W >:


I don't think it really is a "change", in XEP-394 it is already defined
this way ("the last affected codepoint is the one just before end" [1])
and the example in XEP-372 [2] also counts that way (char 72 is the "J"
of and char 78 is the space after "Juliet"). Only the text misleadingly
says "An end attribute is similarly used for the index of the last
character of the reference.", so this may need a clarification.


Well. I strongly object.


Either we need to change the text in XEP-372 slightly or we have to 
change the examples in XEP-372 and the text and examples in XEP-394 
(because both should do the same). I see you have a strong opinion on 
the one side for some reason.



( Btw, did anyone but us implement this XEP at all?  )


Converse has an implementation of XEP-372 for mentions (the only usecase 
that is properly defined in that XEP IMO).


On 'already defined' 394. As we have learned from 0071 debacle, even 
widely implemented XEPs can be deprecated with vague reasoning, so 
deprecating a contradictory XEP that, to my knowledge, wasn't even 
implemented anywhere, shouldn't be too much 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-19 Thread Andrew Nenakhov
ср, 18 дек. 2019 г. в 20:12, Ralph Meijer :

> My assumption was that we are looking at character data on the abstract
> layer /after/ parsing XML. You shouldn't see entities there (they'd be
> resolved to their respective characters), nor should you see  wrappers.
>
Hm, please, define 'abstract' layer more precisely. Citing example from the
XEP proposal, which is the true abstract layer?
this, [image: image.png], or this:[image: image.png] ?  Or the layer with
'codepoints'? Is it really any better than escaped XML text?

This approach is also not very practical. When you do stanza processing on
a server, most often you just take stanza as is, passing all references
data without transferring data to abstract layer back and forth.  Plus,
when doing the web client this means an additional escaping - deescaping
routine every time when something is sent-displayed, cause browsers require
their own escaping.

ср, 18 дек. 2019 г. в 20:41, Marvin W :

> [inline]
>
> On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
> > In the end we have settled for counting characters of escaped string, so
>
> This sounds like a terrible idea. In encoded XML, ">", "", ""
> and "]]>" are equivalent. I just tried it out and servers
> indeed do convert all of those to their shortest well-formed variant
> (which is "") so you cannot rely on their reference length at all.
> Servers may at their discretion convert non-ascii characters to their
> character reference form (starting with 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Marvin W

On 12/18/19 5:00 PM, Ralph Meijer wrote:
I'd not be opposed to changing the definition of 'end' here. Twitter 
Entities [1] also points to the character after.


I don't think it really is a "change", in XEP-394 it is already defined 
this way ("the last affected codepoint is the one just before end" [1]) 
and the example in XEP-372 [2] also counts that way (char 72 is the "J" 
of and char 78 is the space after "Juliet"). Only the text misleadingly 
says "An end attribute is similarly used for the index of the last 
character of the reference.", so this may need a clarification.


[1] https://xmpp.org/extensions/xep-0394.html#usecases-inline
[2] https://xmpp.org/extensions/xep-0372.html#example-3
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Ralph Meijer

On 18-12-2019 16:40, Marvin W wrote:

[..]

Also that's a weird counting there, usually I would expect end to 
point to the position after the last referenced character - at least 
that's what you do in most programming languages (e.g. 
""[0:14] will give you "" without the 
last ";").


I'd not be opposed to changing the definition of 'end' here. Twitter 
Entities [1] also points to the character after.


[1] 
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object


--
ralphm

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Marvin W

[inline]

On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
In the end we have settled for counting characters of escaped string, so 


This sounds like a terrible idea. In encoded XML, ">", "", "" 
and "]]>" are equivalent. I just tried it out and servers 
indeed do convert all of those to their shortest well-formed variant 
(which is "") so you cannot rely on their reference length at all. 
Servers may at their discretion convert non-ascii characters to their 
character reference form (starting with 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Ralph Meijer
My assumption was that we are looking at character data on the abstract 
layer /after/ parsing XML. You shouldn't see entities there (they'd be 
resolved to their respective characters), nor should you see 

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Marvin W
It's indeed a good question if anything in XMPP allows servers or 
in-between entities to do normalization. I was under the assumption that 
servers do not change the codepoints. In XML [1] Characters with 
multiple possible representations in ISO/IEC 10646 (e.g. characters with 
both precomposed and base+diacritic forms) match only if they have the 
same representation in both strings. Thus by XML specification, 
normalization is changing the body.


Also the main reason why we shouldn't ask for Unicode normalization to 
happen is that different Unicode version have different normalizations. 
Thus if the sender normalizes with Unicode version X and calculates 
offsets from that, then receiver normalizes with Unicode version Y and 
determines the offsets there, they can end up in pointing to different 
characters.


[1] https://www.w3.org/TR/REC-xml/#dt-match

On 12/18/19 11:59 AM, Florian Schmaus wrote:

But I wonder if we
shouldn't require Unicode normalization, i.e. the sender and receiver
MUST normalize prior counting.

Given that nothing in XMPP guarantees you that the Unicode is not
transformed somewhere in the stanza processing and routing, e.g. gets
combined, this would be required so that sender and receiver operate on
the same Unicode data.

And I believe that there could be cases where such transformations
actually really happen, e.g. message archives which persist the Unicode
data in combined form for efficiency reasons.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Andrew Nenakhov
We're totally onboard with this XEP, and it is, in fact, the way we already
do count characters for references in all versions of Xabber.

However, there is one important case not addressed in this XEP: XML
predefined entities.

Symbols that are to be escaped, as listed in
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
are:

& -- 
< -- 
> -- 
" -- 
' - 

Counting symbols will result in different outcomes if we count characters
before or after unescaping (from experience, without explicit mention of
this problemour developers have spilt exactly 50/50 on this: 2 developers
did count before unsecaping and 2 - after)

In the end we have settled for counting characters of escaped string, so to
draw *&&&* in a client we count it as string with a length of 15, thus
 reference points to characters 0..14:



вт, 17 дек. 2019 г. в 16:19, :

> The XMPP Extensions Editor has received a proposal for a new XEP.
>
> Title: Character counting in message bodies
> Abstract:
> This document describes how to correctly count characters in message
> bodies. This is required when referencing a position in the body.
>
> URL: https://xmpp.org/extensions/inbox/charcount.html
>
> The Council will decide in the next two weeks whether to accept this
> proposal as an official XEP.
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___
>


-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com 
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Philipp Hörist
Am Mi., 18. Dez. 2019 um 12:02 Uhr schrieb Florian Schmaus :

> I do like to point out that it is probably not really XMPP specific
> (similar to XEP-0392: Consistent Color Generation), but I don't see a
> reason why this shouldn't get XEP'ed up.
>
>
I don't see the similarities, one is a pure UI suggestion (XEP-0392), has
nothing to do with the protocol layer.
The other XEP (Message Counting) tells you how to generate information
which you have to transfer via xmpp to other entities.

One has no impact whatsoever on interoperability.
Without the other there is simply no interoperability.

Not sure what you mean by XMPP specific, yeah im sure other things in this
world also count characters for something, but that does still leave us
with the responsibility to define how XMPP wants to do it.

Regards
Philipp
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-18 Thread Florian Schmaus
On 12/17/19 12:18 PM, p...@bouah.net wrote:
> The XMPP Extensions Editor has received a proposal for a new XEP.
> 
> Title: Character counting in message bodies
> Abstract:
> This document describes how to correctly count characters in message
> bodies. This is required when referencing a position in the body.
> 
> URL: https://xmpp.org/extensions/inbox/charcount.html

As others already said, that is something we need. So thanks Marvin for
submitting this.

I do like to point out that it is probably not really XMPP specific
(similar to XEP-0392: Consistent Color Generation), but I don't see a
reason why this shouldn't get XEP'ed up.

Codepoints as unit had been my first choice too. But I wonder if we
shouldn't require Unicode normalization, i.e. the sender and receiver
MUST normalize prior counting.

Given that nothing in XMPP guarantees you that the Unicode is not
transformed somewhere in the stanza processing and routing, e.g. gets
combined, this would be required so that sender and receiver operate on
the same Unicode data.

And I believe that there could be cases where such transformations
actually really happen, e.g. message archives which persist the Unicode
data in combined form for efficiency reasons.

- Florian
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-17 Thread Lance Stout
An additional reference from XEP-0301 (In-Band Real Time Text) in support of 
this:

https://xmpp.org/extensions/xep-0301.html#unicode_character_counting



> On Dec 17, 2019, at 3:18 AM, p...@bouah.net wrote:
> 
> The XMPP Extensions Editor has received a proposal for a new XEP.
> 
> Title: Character counting in message bodies
> Abstract:
> This document describes how to correctly count characters in message
> bodies. This is required when referencing a position in the body.
> 
> URL: https://xmpp.org/extensions/inbox/charcount.html
> 
> The Council will decide in the next two weeks whether to accept this
> proposal as an official XEP.
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-17 Thread Jonas Schäfer
On Dienstag, 17. Dezember 2019 12:18:53 CET p...@bouah.net wrote:
> The XMPP Extensions Editor has received a proposal for a new XEP.
> 
> Title: Character counting in message bodies
> Abstract:
> This document describes how to correctly count characters in message
> bodies. This is required when referencing a position in the body.
> 
> URL: https://xmpp.org/extensions/inbox/charcount.html
> 
> The Council will decide in the next two weeks whether to accept this
> proposal as an official XEP.

I am firmly +1 on this.

The only thing I consider worth adding is a note that codepoints are the 
foundation of XML and thus match the XML data model nicely, which is another 
point in favour of using them.

kind regards,
Jonas

signature.asc
Description: This is a digitally signed message part.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-17 Thread Kevin Smith
I don’t have feedback to give at the moment, but this is a thing we’ve needed 
for a long time, so a big thank you to Marvin for getting something submitted.

/K

> On 17 Dec 2019, at 11:18, p...@bouah.net wrote:
> 
> The XMPP Extensions Editor has received a proposal for a new XEP.
> 
> Title: Character counting in message bodies
> Abstract:
> This document describes how to correctly count characters in message
> bodies. This is required when referencing a position in the body.
> 
> URL: https://xmpp.org/extensions/inbox/charcount.html
> 
> The Council will decide in the next two weeks whether to accept this
> proposal as an official XEP.
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-17 Thread Guus der Kinderen
This XEP might want to add an implementation note that relates to
https://xmpp.org/extensions/xep-0245.html. When XEP-0245 is used, clients
often use a different representation of the message from what's in the body
(eg: replacing "/me" with a nickname). This makes it very easy to make
mistakes in calculating a character-count based offset (eg: to identify the
position of a mention), as a nickname is likely to have a different
character count than the three in "/me". It's kind of a silly problem, but
easy enough to make. Protecting against it by adding it to this XEP can
help.

On Tue, 17 Dec 2019 at 12:20,  wrote:

> The XMPP Extensions Editor has received a proposal for a new XEP.
>
> Title: Character counting in message bodies
> Abstract:
> This document describes how to correctly count characters in message
> bodies. This is required when referencing a position in the body.
>
> URL: https://xmpp.org/extensions/inbox/charcount.html
>
> The Council will decide in the next two weeks whether to accept this
> proposal as an official XEP.
> ___
> Standards mailing list
> Info: https://mail.jabber.org/mailman/listinfo/standards
> Unsubscribe: standards-unsubscr...@xmpp.org
> ___
>
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


[Standards] Proposed XMPP Extension: Character counting in message bodies

2019-12-17 Thread pep
The XMPP Extensions Editor has received a proposal for a new XEP.

Title: Character counting in message bodies
Abstract:
This document describes how to correctly count characters in message
bodies. This is required when referencing a position in the body.

URL: https://xmpp.org/extensions/inbox/charcount.html

The Council will decide in the next two weeks whether to accept this
proposal as an official XEP.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___