Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-20 Thread Ivan Daschinsky
We copy values unchanged as is in bytes representation. Could you please
specify what could be done wrong?
I see only one possibility:
1. Start cluster with default encoding (This is only the windows case :)).
Set some metastorage values with non ASCII chars.
2. Stop it and restart with specifying encoding to different one.

I suppose that this is very rare case. And all that user should do -- just
erase metastore.

Another variant -- make all users to erase metastore in order to use UTF-8.


пн, 20 дек. 2021 г. в 17:59, Andrey Mashenkov :

> Ivan,
>
> I'm still not sure it is a good idea to upgrade metastorage automatically.
> Because we can't detect the correct charset the metastorage was created
> with, and
> at the same time we can't be sure the current charset is the correct one.
>
> So, is there any guarantee the metastorage is consistent even if it was
> "upgraded" successfully?
>
> As I see, we just copy metastorage keys to a temporary one in key-by-key
> manner... and then do write-back to the original one.
> Seems, if smth goes wrong, the user may get both (original and temporary)
> stores broken.
>
> On Mon, Dec 20, 2021 at 5:27 PM Ivan Daschinsky 
> wrote:
>
> > Andrey,  I believe that we already have all machinery to do migration
> safe.
> > See for
> > example
> >
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage#init
> > and
> >
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.TmpStorage.
> > This machinery was introduced for slightly different task, but we can
> reuse
> > this for the current purpose.
> >
> > пн, 20 дек. 2021 г. в 11:53, Mikhail Petrov :
> >
> > > Thank you all for your replies!
> > > I got the idea and agreed with it. Based on the results of the
> > > discussion, I have filed a ticket [1].
> > > I will try to investigate it.
> > >
> > > [1] - https://issues.apache.org/jira/browse/IGNITE-16157
> > >
> > > On 16.12.2021 20:11, Ivan Daschinsky wrote:
> > > > Andrey, agree with you, good point.
> > > >
> > > > чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov <
> > andrey.mashen...@gmail.com
> > > >:
> > > >
> > > >> Guys,
> > > >>
> > > >> I like the idea with a flag, but for a different purpose.
> > > >> I think it is easy to detect the issue (using the flag) when
> > > >> metastorage was created on a new version with a fixed charset, or on
> > an
> > > >> older version with the user-defined default.
> > > >> Regarding the flag, we can choose a new strategy forcing UTF-8, or
> > > fallback
> > > >> to the old one with defaultCharset and print a warning and
> > > recommendation
> > > >> in log.
> > > >>
> > > >> Adding any compatibility stuff is absolutely error-prone because if
> > you
> > > >> fail in the middle of restoring process, you will get broken
> > metastorage
> > > >> with keys in different charsets.
> > > >> At this point, there is no way to detect broken keys anymore.
> > > >>
> > >
> >
> >
> > --
> > Sincerely yours, Ivan Daschinskiy
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>


-- 
Sincerely yours, Ivan Daschinskiy


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-20 Thread Andrey Mashenkov
Ivan,

I'm still not sure it is a good idea to upgrade metastorage automatically.
Because we can't detect the correct charset the metastorage was created
with, and
at the same time we can't be sure the current charset is the correct one.

So, is there any guarantee the metastorage is consistent even if it was
"upgraded" successfully?

As I see, we just copy metastorage keys to a temporary one in key-by-key
manner... and then do write-back to the original one.
Seems, if smth goes wrong, the user may get both (original and temporary)
stores broken.

On Mon, Dec 20, 2021 at 5:27 PM Ivan Daschinsky  wrote:

> Andrey,  I believe that we already have all machinery to do migration safe.
> See for
> example
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage#init
> and
> org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.TmpStorage.
> This machinery was introduced for slightly different task, but we can reuse
> this for the current purpose.
>
> пн, 20 дек. 2021 г. в 11:53, Mikhail Petrov :
>
> > Thank you all for your replies!
> > I got the idea and agreed with it. Based on the results of the
> > discussion, I have filed a ticket [1].
> > I will try to investigate it.
> >
> > [1] - https://issues.apache.org/jira/browse/IGNITE-16157
> >
> > On 16.12.2021 20:11, Ivan Daschinsky wrote:
> > > Andrey, agree with you, good point.
> > >
> > > чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov <
> andrey.mashen...@gmail.com
> > >:
> > >
> > >> Guys,
> > >>
> > >> I like the idea with a flag, but for a different purpose.
> > >> I think it is easy to detect the issue (using the flag) when
> > >> metastorage was created on a new version with a fixed charset, or on
> an
> > >> older version with the user-defined default.
> > >> Regarding the flag, we can choose a new strategy forcing UTF-8, or
> > fallback
> > >> to the old one with defaultCharset and print a warning and
> > recommendation
> > >> in log.
> > >>
> > >> Adding any compatibility stuff is absolutely error-prone because if
> you
> > >> fail in the middle of restoring process, you will get broken
> metastorage
> > >> with keys in different charsets.
> > >> At this point, there is no way to detect broken keys anymore.
> > >>
> >
>
>
> --
> Sincerely yours, Ivan Daschinskiy
>


-- 
Best regards,
Andrey V. Mashenkov


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-20 Thread Ivan Daschinsky
Andrey,  I believe that we already have all machinery to do migration safe.
See for
example  
org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage#init
and 
org.apache.ignite.internal.processors.cache.persistence.metastorage.MetaStorage.TmpStorage.
This machinery was introduced for slightly different task, but we can reuse
this for the current purpose.

пн, 20 дек. 2021 г. в 11:53, Mikhail Petrov :

> Thank you all for your replies!
> I got the idea and agreed with it. Based on the results of the
> discussion, I have filed a ticket [1].
> I will try to investigate it.
>
> [1] - https://issues.apache.org/jira/browse/IGNITE-16157
>
> On 16.12.2021 20:11, Ivan Daschinsky wrote:
> > Andrey, agree with you, good point.
> >
> > чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov  >:
> >
> >> Guys,
> >>
> >> I like the idea with a flag, but for a different purpose.
> >> I think it is easy to detect the issue (using the flag) when
> >> metastorage was created on a new version with a fixed charset, or on an
> >> older version with the user-defined default.
> >> Regarding the flag, we can choose a new strategy forcing UTF-8, or
> fallback
> >> to the old one with defaultCharset and print a warning and
> recommendation
> >> in log.
> >>
> >> Adding any compatibility stuff is absolutely error-prone because if you
> >> fail in the middle of restoring process, you will get broken metastorage
> >> with keys in different charsets.
> >> At this point, there is no way to detect broken keys anymore.
> >>
>


-- 
Sincerely yours, Ivan Daschinskiy


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-20 Thread Mikhail Petrov

Thank you all for your replies!
I got the idea and agreed with it. Based on the results of the 
discussion, I have filed a ticket [1].

I will try to investigate it.

[1] - https://issues.apache.org/jira/browse/IGNITE-16157

On 16.12.2021 20:11, Ivan Daschinsky wrote:

Andrey, agree with you, good point.

чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov :


Guys,

I like the idea with a flag, but for a different purpose.
I think it is easy to detect the issue (using the flag) when
metastorage was created on a new version with a fixed charset, or on an
older version with the user-defined default.
Regarding the flag, we can choose a new strategy forcing UTF-8, or fallback
to the old one with defaultCharset and print a warning and recommendation
in log.

Adding any compatibility stuff is absolutely error-prone because if you
fail in the middle of restoring process, you will get broken metastorage
with keys in different charsets.
At this point, there is no way to detect broken keys anymore.



Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-16 Thread Ivan Daschinsky
Andrey, agree with you, good point.

чт, 16 дек. 2021 г., 16:27 Andrey Mashenkov :

> Guys,
>
> I like the idea with a flag, but for a different purpose.
> I think it is easy to detect the issue (using the flag) when
> metastorage was created on a new version with a fixed charset, or on an
> older version with the user-defined default.
> Regarding the flag, we can choose a new strategy forcing UTF-8, or fallback
> to the old one with defaultCharset and print a warning and recommendation
> in log.
>
> Adding any compatibility stuff is absolutely error-prone because if you
> fail in the middle of restoring process, you will get broken metastorage
> with keys in different charsets.
> At this point, there is no way to detect broken keys anymore.
>


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-16 Thread Andrey Mashenkov
Guys,

I like the idea with a flag, but for a different purpose.
I think it is easy to detect the issue (using the flag) when
metastorage was created on a new version with a fixed charset, or on an
older version with the user-defined default.
Regarding the flag, we can choose a new strategy forcing UTF-8, or fallback
to the old one with defaultCharset and print a warning and recommendation
in log.

Adding any compatibility stuff is absolutely error-prone because if you
fail in the middle of restoring process, you will get broken metastorage
with keys in different charsets.
At this point, there is no way to detect broken keys anymore.


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-16 Thread Ivan Daschinsky
Slava, great ticket!

I suppose, that we can add feature flag to BPlusMetaIO and if it doesn't
present or it is value is false, we can rebuild metastore during
recovery and decode strings to default system encoding and save all of them
back to UTF-8. After recovery, we should use UTF-8 by default.


чт, 16 дек. 2021 г. в 13:35, Вячеслав Коптилин :

> Hi folks,
>
> IMHO, we should do our best to fix all these places and should avoid using
> the default charset. In my understanding, this is only
>
> > The main question is - should we restrict the join of nodes with
> different encodings or just fix all places where implicit default encoding
> is used and specify the explicit one as Ivan Daschinsky suggested?
> Restricting the join of nodes is not a solution for all cases. You are in
> trouble even though you use a one-node cluster. Just change the default
> charset on your system and restart the node with existing PDS [1]
>
> > As for me, I'm expecting a way more problem with enforcing rule to fail,
> rather than enforcing all components to use UTF-8
> Absolutely agree with Ivan.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-16080
>
> Thanks,
> S.
>
> вт, 14 дек. 2021 г. в 10:52, Ivan Pavlukhin :
>
> > Do encodings in question somehow influence on actual stored data
> > (bytes)? If so, using an implicit platform encoding sounds quite
> > dangerous. Moving data between servers (or perhaps even rebalancing)
> > can lead to bad consequences. Anyways, IMHO an implicit encoding is
> > not good, but sensible default is quite robust.
> >
> > 2021-12-13 23:07 GMT+03:00, Ivan Daschinsky :
> > > Unpaited surrogates are emoji symbols. One should be completely insane
> to
> > > use emojis in login.
> > >
> > > пн, 13 дек. 2021 г., 21:30 Mikhail Petrov :
> > >
> > >> Ivan, string with unpaired surrogates symbols are serialized and
> > >> deserialized by java UTF-8 decoder successfully but the result does
> not
> > >> match the initial string. It may result in that if the user's login
> > >> contains these symbols, it will be distorted after deserialization and
> > >> the user will not be able to log in. I understand that it is a quite
> > >> rare case.
> > >> Anyway, the way to solve this problem was introduced here -
> > >> https://issues.apache.org/jira/browse/IGNITE-3098
> > >>
> > >> Frankly, it is not the topic I would like to discuss now. The main
> > >> question is - should we restrict the join of nodes with different
> > >> encodings or just fix all places where implicit default encoding is
> used
> > >> and specify the explicit one as Ivan Daschinsky suggested?
> > >>
> > >>  From my point of view, it is better to reject nodes with different
> > >> encodings (especially after Ilya Kasnacheev mentioned that we already
> > >> have a warning  "Differing character encodings across cluster may lead
> > >> to erratic behavior"). It will help to avoid "erratic behavior", not
> > >> just warn about it. It is important since the problems related to
> string
> > >> encoding can occur in different components and the cause of them is
> not
> > >> always obvious.
> > >>
> > >> WDYT?
> > >>
> > >> On 13.12.2021 20:01, Ivan Pavlukhin wrote:
> > >> >> I guess Nikolay is talking about the problem with UTF-8 in case
> > string
> > >> contains unpaired surrogate symbols
> > >> > Folks, give me a clue why it is a problem? Naively it seems to be a
> > >> > good restriction rather than problem. What problems can it cause in
> > >> > practice?
> > >> >
> > >> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev
> > >> > :
> > >> >> Hello!
> > >> >>
> > >> >> We already have a warning about this, see
> > >> IgniteKernal.checkFileEncoding()
> > >> >>
> > >> >> Regards,
> > >> >> --
> > >> >> Ilya Kasnacheev
> > >> >>
> > >> >>
> > >> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky  >:
> > >> >>
> > >> > But now multiple components
> > >> > independently serialize strings for their needs and use default
> > >> > encoding
> > >> > for this.
> > >> > For example  DirectByteBufferStreamImplV2#writeString,
> > >> > MetaStorage#writeRaw and so on
> > >> >>> We should fix all of them.
> > >> >>>
> > >> > BinaryUtils#utf8BytesToStr
> > >> >>> Lets use this everywhere.
> > >> >>>
> > >> >>> As for me, I'm expecting a way more problem with enforcing rule to
> > >> fail,
> > >> >>> rather than enforcing all components to use UTF-8
> > >> >>> Some weird cases  (surrogate pairs) we can (I strongly believe it
> is
> > >> OK)
> > >> >>> simply do not consider at all.
> > >> >>>
> > >> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov  >:
> > >> >>>
> > >> > Does Java String support all unicode characters and particularly
> > >> > does
> > >> >>> it
> > >>  support more characters than UTF-8
> > >> 
> > >>  It’s not about Java, it’s about UTF-8 standard.
> > >> 
> > >>  Please, take a look at [1]
> > >> 
> > >> > In November 2003, UTF-8 was restricted by RFC 3629 to match the
> > >>  

Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-16 Thread Вячеслав Коптилин
Hi folks,

IMHO, we should do our best to fix all these places and should avoid using
the default charset. In my understanding, this is only

> The main question is - should we restrict the join of nodes with
different encodings or just fix all places where implicit default encoding
is used and specify the explicit one as Ivan Daschinsky suggested?
Restricting the join of nodes is not a solution for all cases. You are in
trouble even though you use a one-node cluster. Just change the default
charset on your system and restart the node with existing PDS [1]

> As for me, I'm expecting a way more problem with enforcing rule to fail,
rather than enforcing all components to use UTF-8
Absolutely agree with Ivan.

[1] https://issues.apache.org/jira/browse/IGNITE-16080

Thanks,
S.

вт, 14 дек. 2021 г. в 10:52, Ivan Pavlukhin :

> Do encodings in question somehow influence on actual stored data
> (bytes)? If so, using an implicit platform encoding sounds quite
> dangerous. Moving data between servers (or perhaps even rebalancing)
> can lead to bad consequences. Anyways, IMHO an implicit encoding is
> not good, but sensible default is quite robust.
>
> 2021-12-13 23:07 GMT+03:00, Ivan Daschinsky :
> > Unpaited surrogates are emoji symbols. One should be completely insane to
> > use emojis in login.
> >
> > пн, 13 дек. 2021 г., 21:30 Mikhail Petrov :
> >
> >> Ivan, string with unpaired surrogates symbols are serialized and
> >> deserialized by java UTF-8 decoder successfully but the result does not
> >> match the initial string. It may result in that if the user's login
> >> contains these symbols, it will be distorted after deserialization and
> >> the user will not be able to log in. I understand that it is a quite
> >> rare case.
> >> Anyway, the way to solve this problem was introduced here -
> >> https://issues.apache.org/jira/browse/IGNITE-3098
> >>
> >> Frankly, it is not the topic I would like to discuss now. The main
> >> question is - should we restrict the join of nodes with different
> >> encodings or just fix all places where implicit default encoding is used
> >> and specify the explicit one as Ivan Daschinsky suggested?
> >>
> >>  From my point of view, it is better to reject nodes with different
> >> encodings (especially after Ilya Kasnacheev mentioned that we already
> >> have a warning  "Differing character encodings across cluster may lead
> >> to erratic behavior"). It will help to avoid "erratic behavior", not
> >> just warn about it. It is important since the problems related to string
> >> encoding can occur in different components and the cause of them is not
> >> always obvious.
> >>
> >> WDYT?
> >>
> >> On 13.12.2021 20:01, Ivan Pavlukhin wrote:
> >> >> I guess Nikolay is talking about the problem with UTF-8 in case
> string
> >> contains unpaired surrogate symbols
> >> > Folks, give me a clue why it is a problem? Naively it seems to be a
> >> > good restriction rather than problem. What problems can it cause in
> >> > practice?
> >> >
> >> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev
> >> > :
> >> >> Hello!
> >> >>
> >> >> We already have a warning about this, see
> >> IgniteKernal.checkFileEncoding()
> >> >>
> >> >> Regards,
> >> >> --
> >> >> Ilya Kasnacheev
> >> >>
> >> >>
> >> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky :
> >> >>
> >> > But now multiple components
> >> > independently serialize strings for their needs and use default
> >> > encoding
> >> > for this.
> >> > For example  DirectByteBufferStreamImplV2#writeString,
> >> > MetaStorage#writeRaw and so on
> >> >>> We should fix all of them.
> >> >>>
> >> > BinaryUtils#utf8BytesToStr
> >> >>> Lets use this everywhere.
> >> >>>
> >> >>> As for me, I'm expecting a way more problem with enforcing rule to
> >> fail,
> >> >>> rather than enforcing all components to use UTF-8
> >> >>> Some weird cases  (surrogate pairs) we can (I strongly believe it is
> >> OK)
> >> >>> simply do not consider at all.
> >> >>>
> >> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :
> >> >>>
> >> > Does Java String support all unicode characters and particularly
> >> > does
> >> >>> it
> >>  support more characters than UTF-8
> >> 
> >>  It’s not about Java, it’s about UTF-8 standard.
> >> 
> >>  Please, take a look at [1]
> >> 
> >> > In November 2003, UTF-8 was restricted by RFC 3629 to match the
> >>  constraints of the UTF-16 character encoding: explicitly
> prohibiting
> >>  code
> >>  points corresponding to the high and low surrogate characters
> >>  removed
> >> >>> more
> >>  than 3% of the three-byte sequences, and ending at U+10 removed
> >>  more
> >>  than 48% of the four-byte sequences and all five- and six-byte
> >>  sequences.
> >> 
> >>  And [2]
> >> 
> >> > The definition of UTF-8 prohibits encoding character numbers
> >> > between
> >>  U+D800 and U+DFFF, which are reserved for use with the UTF-16
> >> 

Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Pavlukhin
Do encodings in question somehow influence on actual stored data
(bytes)? If so, using an implicit platform encoding sounds quite
dangerous. Moving data between servers (or perhaps even rebalancing)
can lead to bad consequences. Anyways, IMHO an implicit encoding is
not good, but sensible default is quite robust.

2021-12-13 23:07 GMT+03:00, Ivan Daschinsky :
> Unpaited surrogates are emoji symbols. One should be completely insane to
> use emojis in login.
>
> пн, 13 дек. 2021 г., 21:30 Mikhail Petrov :
>
>> Ivan, string with unpaired surrogates symbols are serialized and
>> deserialized by java UTF-8 decoder successfully but the result does not
>> match the initial string. It may result in that if the user's login
>> contains these symbols, it will be distorted after deserialization and
>> the user will not be able to log in. I understand that it is a quite
>> rare case.
>> Anyway, the way to solve this problem was introduced here -
>> https://issues.apache.org/jira/browse/IGNITE-3098
>>
>> Frankly, it is not the topic I would like to discuss now. The main
>> question is - should we restrict the join of nodes with different
>> encodings or just fix all places where implicit default encoding is used
>> and specify the explicit one as Ivan Daschinsky suggested?
>>
>>  From my point of view, it is better to reject nodes with different
>> encodings (especially after Ilya Kasnacheev mentioned that we already
>> have a warning  "Differing character encodings across cluster may lead
>> to erratic behavior"). It will help to avoid "erratic behavior", not
>> just warn about it. It is important since the problems related to string
>> encoding can occur in different components and the cause of them is not
>> always obvious.
>>
>> WDYT?
>>
>> On 13.12.2021 20:01, Ivan Pavlukhin wrote:
>> >> I guess Nikolay is talking about the problem with UTF-8 in case string
>> contains unpaired surrogate symbols
>> > Folks, give me a clue why it is a problem? Naively it seems to be a
>> > good restriction rather than problem. What problems can it cause in
>> > practice?
>> >
>> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev
>> > :
>> >> Hello!
>> >>
>> >> We already have a warning about this, see
>> IgniteKernal.checkFileEncoding()
>> >>
>> >> Regards,
>> >> --
>> >> Ilya Kasnacheev
>> >>
>> >>
>> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky :
>> >>
>> > But now multiple components
>> > independently serialize strings for their needs and use default
>> > encoding
>> > for this.
>> > For example  DirectByteBufferStreamImplV2#writeString,
>> > MetaStorage#writeRaw and so on
>> >>> We should fix all of them.
>> >>>
>> > BinaryUtils#utf8BytesToStr
>> >>> Lets use this everywhere.
>> >>>
>> >>> As for me, I'm expecting a way more problem with enforcing rule to
>> fail,
>> >>> rather than enforcing all components to use UTF-8
>> >>> Some weird cases  (surrogate pairs) we can (I strongly believe it is
>> OK)
>> >>> simply do not consider at all.
>> >>>
>> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :
>> >>>
>> > Does Java String support all unicode characters and particularly
>> > does
>> >>> it
>>  support more characters than UTF-8
>> 
>>  It’s not about Java, it’s about UTF-8 standard.
>> 
>>  Please, take a look at [1]
>> 
>> > In November 2003, UTF-8 was restricted by RFC 3629 to match the
>>  constraints of the UTF-16 character encoding: explicitly prohibiting
>>  code
>>  points corresponding to the high and low surrogate characters
>>  removed
>> >>> more
>>  than 3% of the three-byte sequences, and ending at U+10 removed
>>  more
>>  than 48% of the four-byte sequences and all five- and six-byte
>>  sequences.
>> 
>>  And [2]
>> 
>> > The definition of UTF-8 prohibits encoding character numbers
>> > between
>>  U+D800 and U+DFFF, which are reserved for use with the UTF-16
>>  encoding
>> >>> form
>>  (as surrogate pairs) and do not directly represent characters.
>> 
>>  Actually, we already has some modes to support this restriction of
>>  UTF-8.
>>  Please, take a look at BinaryUtils#utf8BytesToStr [3]
>> 
>> 
>>  [1] https://en.wikipedia.org/wiki/UTF-8
>>  [2] https://datatracker.ietf.org/doc/html/rfc3629
>>  [3]
>> 
>> >>>
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
>> > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin 
>>  написал(а):
>> >> UTF-8 can’t encode all UNICODE characters.
>> > Nikolay, could you please elaborate? My understanding is that
>> > encoding
>> > we speak about matters for conversion from byte arrays to strings.
>> > Does Java String support all unicode characters and particularly
>> > does
>> > it support more characters than UTF-8 (I am not saying here that
>> > java
>> > String uses UTF-8)?
>> >
>> 

Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Daschinsky
Unpaited surrogates are emoji symbols. One should be completely insane to
use emojis in login.

пн, 13 дек. 2021 г., 21:30 Mikhail Petrov :

> Ivan, string with unpaired surrogates symbols are serialized and
> deserialized by java UTF-8 decoder successfully but the result does not
> match the initial string. It may result in that if the user's login
> contains these symbols, it will be distorted after deserialization and
> the user will not be able to log in. I understand that it is a quite
> rare case.
> Anyway, the way to solve this problem was introduced here -
> https://issues.apache.org/jira/browse/IGNITE-3098
>
> Frankly, it is not the topic I would like to discuss now. The main
> question is - should we restrict the join of nodes with different
> encodings or just fix all places where implicit default encoding is used
> and specify the explicit one as Ivan Daschinsky suggested?
>
>  From my point of view, it is better to reject nodes with different
> encodings (especially after Ilya Kasnacheev mentioned that we already
> have a warning  "Differing character encodings across cluster may lead
> to erratic behavior"). It will help to avoid "erratic behavior", not
> just warn about it. It is important since the problems related to string
> encoding can occur in different components and the cause of them is not
> always obvious.
>
> WDYT?
>
> On 13.12.2021 20:01, Ivan Pavlukhin wrote:
> >> I guess Nikolay is talking about the problem with UTF-8 in case string
> contains unpaired surrogate symbols
> > Folks, give me a clue why it is a problem? Naively it seems to be a
> > good restriction rather than problem. What problems can it cause in
> > practice?
> >
> > 2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev :
> >> Hello!
> >>
> >> We already have a warning about this, see
> IgniteKernal.checkFileEncoding()
> >>
> >> Regards,
> >> --
> >> Ilya Kasnacheev
> >>
> >>
> >> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky :
> >>
> > But now multiple components
> > independently serialize strings for their needs and use default
> > encoding
> > for this.
> > For example  DirectByteBufferStreamImplV2#writeString,
> > MetaStorage#writeRaw and so on
> >>> We should fix all of them.
> >>>
> > BinaryUtils#utf8BytesToStr
> >>> Lets use this everywhere.
> >>>
> >>> As for me, I'm expecting a way more problem with enforcing rule to
> fail,
> >>> rather than enforcing all components to use UTF-8
> >>> Some weird cases  (surrogate pairs) we can (I strongly believe it is
> OK)
> >>> simply do not consider at all.
> >>>
> >>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :
> >>>
> > Does Java String support all unicode characters and particularly does
> >>> it
>  support more characters than UTF-8
> 
>  It’s not about Java, it’s about UTF-8 standard.
> 
>  Please, take a look at [1]
> 
> > In November 2003, UTF-8 was restricted by RFC 3629 to match the
>  constraints of the UTF-16 character encoding: explicitly prohibiting
>  code
>  points corresponding to the high and low surrogate characters removed
> >>> more
>  than 3% of the three-byte sequences, and ending at U+10 removed
>  more
>  than 48% of the four-byte sequences and all five- and six-byte
>  sequences.
> 
>  And [2]
> 
> > The definition of UTF-8 prohibits encoding character numbers between
>  U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
> >>> form
>  (as surrogate pairs) and do not directly represent characters.
> 
>  Actually, we already has some modes to support this restriction of
>  UTF-8.
>  Please, take a look at BinaryUtils#utf8BytesToStr [3]
> 
> 
>  [1] https://en.wikipedia.org/wiki/UTF-8
>  [2] https://datatracker.ietf.org/doc/html/rfc3629
>  [3]
> 
> >>>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
> > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin 
>  написал(а):
> >> UTF-8 can’t encode all UNICODE characters.
> > Nikolay, could you please elaborate? My understanding is that
> > encoding
> > we speak about matters for conversion from byte arrays to strings.
> > Does Java String support all unicode characters and particularly does
> > it support more characters than UTF-8 (I am not saying here that java
> > String uses UTF-8)?
> >
> > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :
> >> UTF-8 is already a default encoding in our BinaryObject format.
> >> So
>  I am
> >> for unification.
> >>
> >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :
> >>
> >>> Hello, Ivan.
> >>>
> >>> UTF-8 can’t encode all UNICODE characters.
> >>>
>  13 дек. 2021 г., в 12:49, Ivan Daschinsky 
> >>> написал(а):
>  Khm, maybe a better variant is  to enforce all strings to be
>  encoded
>  in
>  

Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Mikhail Petrov
Ivan, string with unpaired surrogates symbols are serialized and 
deserialized by java UTF-8 decoder successfully but the result does not 
match the initial string. It may result in that if the user's login 
contains these symbols, it will be distorted after deserialization and 
the user will not be able to log in. I understand that it is a quite 
rare case.
Anyway, the way to solve this problem was introduced here - 
https://issues.apache.org/jira/browse/IGNITE-3098


Frankly, it is not the topic I would like to discuss now. The main 
question is - should we restrict the join of nodes with different 
encodings or just fix all places where implicit default encoding is used 
and specify the explicit one as Ivan Daschinsky suggested?


From my point of view, it is better to reject nodes with different 
encodings (especially after Ilya Kasnacheev mentioned that we already 
have a warning  "Differing character encodings across cluster may lead 
to erratic behavior"). It will help to avoid "erratic behavior", not 
just warn about it. It is important since the problems related to string 
encoding can occur in different components and the cause of them is not 
always obvious.


WDYT?

On 13.12.2021 20:01, Ivan Pavlukhin wrote:

I guess Nikolay is talking about the problem with UTF-8 in case string contains 
unpaired surrogate symbols

Folks, give me a clue why it is a problem? Naively it seems to be a
good restriction rather than problem. What problems can it cause in
practice?

2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev :

Hello!

We already have a warning about this, see IgniteKernal.checkFileEncoding()

Regards,
--
Ilya Kasnacheev


пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky :


But now multiple components
independently serialize strings for their needs and use default
encoding
for this.
For example  DirectByteBufferStreamImplV2#writeString,
MetaStorage#writeRaw and so on

We should fix all of them.


BinaryUtils#utf8BytesToStr

Lets use this everywhere.

As for me, I'm expecting a way more problem with enforcing rule to fail,
rather than enforcing all components to use UTF-8
Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
simply do not consider at all.

пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :


Does Java String support all unicode characters and particularly does

it

support more characters than UTF-8

It’s not about Java, it’s about UTF-8 standard.

Please, take a look at [1]


In November 2003, UTF-8 was restricted by RFC 3629 to match the

constraints of the UTF-16 character encoding: explicitly prohibiting
code
points corresponding to the high and low surrogate characters removed

more

than 3% of the three-byte sequences, and ending at U+10 removed
more
than 48% of the four-byte sequences and all five- and six-byte
sequences.

And [2]


The definition of UTF-8 prohibits encoding character numbers between

U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding

form

(as surrogate pairs) and do not directly represent characters.

Actually, we already has some modes to support this restriction of
UTF-8.
Please, take a look at BinaryUtils#utf8BytesToStr [3]


[1] https://en.wikipedia.org/wiki/UTF-8
[2] https://datatracker.ietf.org/doc/html/rfc3629
[3]


https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387

13 дек. 2021 г., в 13:57, Ivan Pavlukhin 

написал(а):

UTF-8 can’t encode all UNICODE characters.

Nikolay, could you please elaborate? My understanding is that
encoding
we speak about matters for conversion from byte arrays to strings.
Does Java String support all unicode characters and particularly does
it support more characters than UTF-8 (I am not saying here that java
String uses UTF-8)?

2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :

UTF-8 is already a default encoding in our BinaryObject format.
So

I am

for unification.

пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :


Hello, Ivan.

UTF-8 can’t encode all UNICODE characters.


13 дек. 2021 г., в 12:49, Ivan Daschinsky 

написал(а):

Khm, maybe a better variant is  to enforce all strings to be
encoded

in

UTF-8?
AFAIK multi OS cluster is a quite common case.


пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov 
:

Igniters,

Recently we faced the problem that if the cluster consists of
nodes
running in the JVM with different encodings, many issues arise.
The root cause of the mentioned issues is components that use
`String#getBytes()` and `new String()`, which relies
on
the
system default encoding. Thus, if a string is deserialized on a

node

with a different encoding from the one that serialized it, the
deserialized string can be different from the original one.

For example:

Serialization/deserialization of string in communication messages

may

be
broken for some strings on nodes running in a JVM with a
different
encoding as DirectByteBufferStreamImplV2 uses String#getBytes()
to
serialize strings - [1]

Or the 

Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Pavlukhin
> I guess Nikolay is talking about the problem with UTF-8 in case string 
> contains unpaired surrogate symbols

Folks, give me a clue why it is a problem? Naively it seems to be a
good restriction rather than problem. What problems can it cause in
practice?

2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev :
> Hello!
>
> We already have a warning about this, see IgniteKernal.checkFileEncoding()
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky :
>
>> >> But now multiple components
>> >> independently serialize strings for their needs and use default
>> >> encoding
>> >> for this.
>> >> For example  DirectByteBufferStreamImplV2#writeString,
>> >> MetaStorage#writeRaw and so on
>> We should fix all of them.
>>
>> >> BinaryUtils#utf8BytesToStr
>> Lets use this everywhere.
>>
>> As for me, I'm expecting a way more problem with enforcing rule to fail,
>> rather than enforcing all components to use UTF-8
>> Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
>> simply do not consider at all.
>>
>> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :
>>
>> > > Does Java String support all unicode characters and particularly does
>> it
>> > support more characters than UTF-8
>> >
>> > It’s not about Java, it’s about UTF-8 standard.
>> >
>> > Please, take a look at [1]
>> >
>> > > In November 2003, UTF-8 was restricted by RFC 3629 to match the
>> > constraints of the UTF-16 character encoding: explicitly prohibiting
>> > code
>> > points corresponding to the high and low surrogate characters removed
>> more
>> > than 3% of the three-byte sequences, and ending at U+10 removed
>> > more
>> > than 48% of the four-byte sequences and all five- and six-byte
>> > sequences.
>> >
>> > And [2]
>> >
>> > > The definition of UTF-8 prohibits encoding character numbers between
>> > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
>> form
>> > (as surrogate pairs) and do not directly represent characters.
>> >
>> > Actually, we already has some modes to support this restriction of
>> > UTF-8.
>> > Please, take a look at BinaryUtils#utf8BytesToStr [3]
>> >
>> >
>> > [1] https://en.wikipedia.org/wiki/UTF-8
>> > [2] https://datatracker.ietf.org/doc/html/rfc3629
>> > [3]
>> >
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
>> >
>> > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin 
>> > написал(а):
>> > >
>> > >> UTF-8 can’t encode all UNICODE characters.
>> > >
>> > > Nikolay, could you please elaborate? My understanding is that
>> > > encoding
>> > > we speak about matters for conversion from byte arrays to strings.
>> > > Does Java String support all unicode characters and particularly does
>> > > it support more characters than UTF-8 (I am not saying here that java
>> > > String uses UTF-8)?
>> > >
>> > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :
>> > >> UTF-8 is already a default encoding in our BinaryObject format.
>> > >> So
>> > I am
>> > >> for unification.
>> > >>
>> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :
>> > >>
>> > >>> Hello, Ivan.
>> > >>>
>> > >>> UTF-8 can’t encode all UNICODE characters.
>> > >>>
>> >  13 дек. 2021 г., в 12:49, Ivan Daschinsky 
>> > >>> написал(а):
>> > 
>> >  Khm, maybe a better variant is  to enforce all strings to be
>> >  encoded
>> > in
>> >  UTF-8?
>> >  AFAIK multi OS cluster is a quite common case.
>> > 
>> > 
>> >  пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov > >:
>> > 
>> > > Igniters,
>> > >
>> > > Recently we faced the problem that if the cluster consists of
>> > > nodes
>> > > running in the JVM with different encodings, many issues arise.
>> > > The root cause of the mentioned issues is components that use
>> > > `String#getBytes()` and `new String()`, which relies
>> > > on
>> > > the
>> > > system default encoding. Thus, if a string is deserialized on a
>> node
>> > > with a different encoding from the one that serialized it, the
>> > > deserialized string can be different from the original one.
>> > >
>> > > For example:
>> > >
>> > > Serialization/deserialization of string in communication messages
>> may
>> > > be
>> > > broken for some strings on nodes running in a JVM with a
>> > > different
>> > > encoding as DirectByteBufferStreamImplV2 uses String#getBytes()
>> > > to
>> > > serialize strings - [1]
>> > >
>> > > Or the IgniteAuthenticationProcessor can compute different
>> > > security
>> > > IDs
>> > > for the user on different nodes in this case - [2]
>> > >
>> > > What do you think, if we solve this problem globally, by
>> > > rejecting
>> to
>> > > join nodes that run on JVMs with different encodings?
>> > >
>> > > As a result, we will be sure that all cluster nodes have the same
>> > > encoding and all related problems will be 

Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ilya Kasnacheev
Hello!

We already have a warning about this, see IgniteKernal.checkFileEncoding()

Regards,
-- 
Ilya Kasnacheev


пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky :

> >> But now multiple components
> >> independently serialize strings for their needs and use default encoding
> >> for this.
> >> For example  DirectByteBufferStreamImplV2#writeString,
> >> MetaStorage#writeRaw and so on
> We should fix all of them.
>
> >> BinaryUtils#utf8BytesToStr
> Lets use this everywhere.
>
> As for me, I'm expecting a way more problem with enforcing rule to fail,
> rather than enforcing all components to use UTF-8
> Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
> simply do not consider at all.
>
> пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :
>
> > > Does Java String support all unicode characters and particularly does
> it
> > support more characters than UTF-8
> >
> > It’s not about Java, it’s about UTF-8 standard.
> >
> > Please, take a look at [1]
> >
> > > In November 2003, UTF-8 was restricted by RFC 3629 to match the
> > constraints of the UTF-16 character encoding: explicitly prohibiting code
> > points corresponding to the high and low surrogate characters removed
> more
> > than 3% of the three-byte sequences, and ending at U+10 removed more
> > than 48% of the four-byte sequences and all five- and six-byte sequences.
> >
> > And [2]
> >
> > > The definition of UTF-8 prohibits encoding character numbers between
> > U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
> form
> > (as surrogate pairs) and do not directly represent characters.
> >
> > Actually, we already has some modes to support this restriction of UTF-8.
> > Please, take a look at BinaryUtils#utf8BytesToStr [3]
> >
> >
> > [1] https://en.wikipedia.org/wiki/UTF-8
> > [2] https://datatracker.ietf.org/doc/html/rfc3629
> > [3]
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
> >
> > > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin 
> > написал(а):
> > >
> > >> UTF-8 can’t encode all UNICODE characters.
> > >
> > > Nikolay, could you please elaborate? My understanding is that encoding
> > > we speak about matters for conversion from byte arrays to strings.
> > > Does Java String support all unicode characters and particularly does
> > > it support more characters than UTF-8 (I am not saying here that java
> > > String uses UTF-8)?
> > >
> > > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :
> > >> UTF-8 is already a default encoding in our BinaryObject format. So
> > I am
> > >> for unification.
> > >>
> > >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :
> > >>
> > >>> Hello, Ivan.
> > >>>
> > >>> UTF-8 can’t encode all UNICODE characters.
> > >>>
> >  13 дек. 2021 г., в 12:49, Ivan Daschinsky 
> > >>> написал(а):
> > 
> >  Khm, maybe a better variant is  to enforce all strings to be encoded
> > in
> >  UTF-8?
> >  AFAIK multi OS cluster is a quite common case.
> > 
> > 
> >  пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov  >:
> > 
> > > Igniters,
> > >
> > > Recently we faced the problem that if the cluster consists of nodes
> > > running in the JVM with different encodings, many issues arise.
> > > The root cause of the mentioned issues is components that use
> > > `String#getBytes()` and `new String()`, which relies on
> > > the
> > > system default encoding. Thus, if a string is deserialized on a
> node
> > > with a different encoding from the one that serialized it, the
> > > deserialized string can be different from the original one.
> > >
> > > For example:
> > >
> > > Serialization/deserialization of string in communication messages
> may
> > > be
> > > broken for some strings on nodes running in a JVM with a different
> > > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> > > serialize strings - [1]
> > >
> > > Or the IgniteAuthenticationProcessor can compute different security
> > > IDs
> > > for the user on different nodes in this case - [2]
> > >
> > > What do you think, if we solve this problem globally, by rejecting
> to
> > > join nodes that run on JVMs with different encodings?
> > >
> > > As a result, we will be sure that all cluster nodes have the same
> > > encoding and all related problems will be solved.
> > >
> > > [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> > > [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> > >
> > > --
> > > Mikhail
> > >
> > >
> > 
> >  --
> >  Sincerely yours, Ivan Daschinskiy
> > >>>
> > >>>
> > >>
> > >> --
> > >> Sincerely yours, Ivan Daschinskiy
> > >>
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
>
> --
> Sincerely yours, Ivan Daschinskiy
>


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Daschinsky
>> But now multiple components
>> independently serialize strings for their needs and use default encoding
>> for this.
>> For example  DirectByteBufferStreamImplV2#writeString,
>> MetaStorage#writeRaw and so on
We should fix all of them.

>> BinaryUtils#utf8BytesToStr
Lets use this everywhere.

As for me, I'm expecting a way more problem with enforcing rule to fail,
rather than enforcing all components to use UTF-8
Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
simply do not consider at all.

пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov :

> > Does Java String support all unicode characters and particularly does it
> support more characters than UTF-8
>
> It’s not about Java, it’s about UTF-8 standard.
>
> Please, take a look at [1]
>
> > In November 2003, UTF-8 was restricted by RFC 3629 to match the
> constraints of the UTF-16 character encoding: explicitly prohibiting code
> points corresponding to the high and low surrogate characters removed more
> than 3% of the three-byte sequences, and ending at U+10 removed more
> than 48% of the four-byte sequences and all five- and six-byte sequences.
>
> And [2]
>
> > The definition of UTF-8 prohibits encoding character numbers between
> U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form
> (as surrogate pairs) and do not directly represent characters.
>
> Actually, we already has some modes to support this restriction of UTF-8.
> Please, take a look at BinaryUtils#utf8BytesToStr [3]
>
>
> [1] https://en.wikipedia.org/wiki/UTF-8
> [2] https://datatracker.ietf.org/doc/html/rfc3629
> [3]
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
>
> > 13 дек. 2021 г., в 13:57, Ivan Pavlukhin 
> написал(а):
> >
> >> UTF-8 can’t encode all UNICODE characters.
> >
> > Nikolay, could you please elaborate? My understanding is that encoding
> > we speak about matters for conversion from byte arrays to strings.
> > Does Java String support all unicode characters and particularly does
> > it support more characters than UTF-8 (I am not saying here that java
> > String uses UTF-8)?
> >
> > 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :
> >> UTF-8 is already a default encoding in our BinaryObject format. So
> I am
> >> for unification.
> >>
> >> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :
> >>
> >>> Hello, Ivan.
> >>>
> >>> UTF-8 can’t encode all UNICODE characters.
> >>>
>  13 дек. 2021 г., в 12:49, Ivan Daschinsky 
> >>> написал(а):
> 
>  Khm, maybe a better variant is  to enforce all strings to be encoded
> in
>  UTF-8?
>  AFAIK multi OS cluster is a quite common case.
> 
> 
>  пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :
> 
> > Igniters,
> >
> > Recently we faced the problem that if the cluster consists of nodes
> > running in the JVM with different encodings, many issues arise.
> > The root cause of the mentioned issues is components that use
> > `String#getBytes()` and `new String()`, which relies on
> > the
> > system default encoding. Thus, if a string is deserialized on a node
> > with a different encoding from the one that serialized it, the
> > deserialized string can be different from the original one.
> >
> > For example:
> >
> > Serialization/deserialization of string in communication messages may
> > be
> > broken for some strings on nodes running in a JVM with a different
> > encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> > serialize strings - [1]
> >
> > Or the IgniteAuthenticationProcessor can compute different security
> > IDs
> > for the user on different nodes in this case - [2]
> >
> > What do you think, if we solve this problem globally, by rejecting to
> > join nodes that run on JVMs with different encodings?
> >
> > As a result, we will be sure that all cluster nodes have the same
> > encoding and all related problems will be solved.
> >
> > [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> > [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> >
> > --
> > Mikhail
> >
> >
> 
>  --
>  Sincerely yours, Ivan Daschinskiy
> >>>
> >>>
> >>
> >> --
> >> Sincerely yours, Ivan Daschinskiy
> >>
> >
> >
> > --
> >
> > Best regards,
> > Ivan Pavlukhin
>
>

-- 
Sincerely yours, Ivan Daschinskiy


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Nikolay Izhikov
> Does Java String support all unicode characters and particularly does it 
> support more characters than UTF-8

It’s not about Java, it’s about UTF-8 standard.

Please, take a look at [1] 

> In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints 
> of the UTF-16 character encoding: explicitly prohibiting code points 
> corresponding to the high and low surrogate characters removed more than 3% 
> of the three-byte sequences, and ending at U+10 removed more than 48% of 
> the four-byte sequences and all five- and six-byte sequences.

And [2] 

> The definition of UTF-8 prohibits encoding character numbers between U+D800 
> and U+DFFF, which are reserved for use with the UTF-16 encoding form (as 
> surrogate pairs) and do not directly represent characters.

Actually, we already has some modes to support this restriction of UTF-8.
Please, take a look at BinaryUtils#utf8BytesToStr [3]


[1] https://en.wikipedia.org/wiki/UTF-8
[2] https://datatracker.ietf.org/doc/html/rfc3629
[3] 
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387

> 13 дек. 2021 г., в 13:57, Ivan Pavlukhin  написал(а):
> 
>> UTF-8 can’t encode all UNICODE characters.
> 
> Nikolay, could you please elaborate? My understanding is that encoding
> we speak about matters for conversion from byte arrays to strings.
> Does Java String support all unicode characters and particularly does
> it support more characters than UTF-8 (I am not saying here that java
> String uses UTF-8)?
> 
> 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :
>> UTF-8 is already a default encoding in our BinaryObject format. So I am
>> for unification.
>> 
>> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :
>> 
>>> Hello, Ivan.
>>> 
>>> UTF-8 can’t encode all UNICODE characters.
>>> 
 13 дек. 2021 г., в 12:49, Ivan Daschinsky 
>>> написал(а):
 
 Khm, maybe a better variant is  to enforce all strings to be encoded in
 UTF-8?
 AFAIK multi OS cluster is a quite common case.
 
 
 пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :
 
> Igniters,
> 
> Recently we faced the problem that if the cluster consists of nodes
> running in the JVM with different encodings, many issues arise.
> The root cause of the mentioned issues is components that use
> `String#getBytes()` and `new String()`, which relies on
> the
> system default encoding. Thus, if a string is deserialized on a node
> with a different encoding from the one that serialized it, the
> deserialized string can be different from the original one.
> 
> For example:
> 
> Serialization/deserialization of string in communication messages may
> be
> broken for some strings on nodes running in a JVM with a different
> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> serialize strings - [1]
> 
> Or the IgniteAuthenticationProcessor can compute different security
> IDs
> for the user on different nodes in this case - [2]
> 
> What do you think, if we solve this problem globally, by rejecting to
> join nodes that run on JVMs with different encodings?
> 
> As a result, we will be sure that all cluster nodes have the same
> encoding and all related problems will be solved.
> 
> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> 
> --
> Mikhail
> 
> 
 
 --
 Sincerely yours, Ivan Daschinskiy
>>> 
>>> 
>> 
>> --
>> Sincerely yours, Ivan Daschinskiy
>> 
> 
> 
> -- 
> 
> Best regards,
> Ivan Pavlukhin



Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Mikhail Petrov

Ivan Daschinsky,


better variant is  to enforce all strings to be encoded in
UTF-8


I agree that it is possible way to go. But now multiple components 
independently serialize strings for their needs and use default encoding 
for this.
For example  DirectByteBufferStreamImplV2#writeString, 
MetaStorage#writeRaw and so on. Even if we fix all this cases we cannot 
guarantee that described above problem will not arise again.


Also it seems to be easy for the user  to  specify encoding for the 
Ignite Java process manually - through `file.encoding` system property.


Ivan Pavlukhin,

I guess Nikolay is talking about the problem with UTF-8 in case string 
contains unpaired surrogate symbols (e.g. used for encoding in UTF-16). 
In this case UTF-8 fails to serialize this string correctly since 
unpaired surrogates characters are forbidden  in UTF-8. Though this 
problem was solved for binary marshaller - see 
`BinaryWriterExImpl#doWriteString`  and `BinaryUtils#strToUtf8Bytes`


On 13.12.2021 13:57, Ivan Pavlukhin wrote:

UTF-8 can’t encode all UNICODE characters.

Nikolay, could you please elaborate? My understanding is that encoding
we speak about matters for conversion from byte arrays to strings.
Does Java String support all unicode characters and particularly does
it support more characters than UTF-8 (I am not saying here that java
String uses UTF-8)?

2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :

UTF-8 is already a default encoding in our BinaryObject format. So I am
for unification.

пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :


Hello, Ivan.

UTF-8 can’t encode all UNICODE characters.


13 дек. 2021 г., в 12:49, Ivan Daschinsky 

написал(а):

Khm, maybe a better variant is  to enforce all strings to be encoded in
UTF-8?
AFAIK multi OS cluster is a quite common case.


пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :


Igniters,

Recently we faced the problem that if the cluster consists of nodes
running in the JVM with different encodings, many issues arise.
The root cause of the mentioned issues is components that use
`String#getBytes()` and `new String()`, which relies on
the
system default encoding. Thus, if a string is deserialized on a node
with a different encoding from the one that serialized it, the
deserialized string can be different from the original one.

For example:

Serialization/deserialization of string in communication messages may
be
broken for some strings on nodes running in a JVM with a different
encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
serialize strings - [1]

Or the IgniteAuthenticationProcessor can compute different security
IDs
for the user on different nodes in this case - [2]

What do you think, if we solve this problem globally, by rejecting to
join nodes that run on JVMs with different encodings?

As a result, we will be sure that all cluster nodes have the same
encoding and all related problems will be solved.

[1] - https://issues.apache.org/jira/browse/IGNITE-16106
[2] - https://issues.apache.org/jira/browse/IGNITE-16068

--
Mikhail



--
Sincerely yours, Ivan Daschinskiy



--
Sincerely yours, Ivan Daschinskiy





Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Pavlukhin
> UTF-8 can’t encode all UNICODE characters.

Nikolay, could you please elaborate? My understanding is that encoding
we speak about matters for conversion from byte arrays to strings.
Does Java String support all unicode characters and particularly does
it support more characters than UTF-8 (I am not saying here that java
String uses UTF-8)?

2021-12-13 12:56 GMT+03:00, Ivan Daschinsky :
> UTF-8 is already a default encoding in our BinaryObject format. So I am
> for unification.
>
> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :
>
>> Hello, Ivan.
>>
>> UTF-8 can’t encode all UNICODE characters.
>>
>> > 13 дек. 2021 г., в 12:49, Ivan Daschinsky 
>> написал(а):
>> >
>> > Khm, maybe a better variant is  to enforce all strings to be encoded in
>> > UTF-8?
>> > AFAIK multi OS cluster is a quite common case.
>> >
>> >
>> > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :
>> >
>> >> Igniters,
>> >>
>> >> Recently we faced the problem that if the cluster consists of nodes
>> >> running in the JVM with different encodings, many issues arise.
>> >> The root cause of the mentioned issues is components that use
>> >> `String#getBytes()` and `new String()`, which relies on
>> >> the
>> >> system default encoding. Thus, if a string is deserialized on a node
>> >> with a different encoding from the one that serialized it, the
>> >> deserialized string can be different from the original one.
>> >>
>> >> For example:
>> >>
>> >> Serialization/deserialization of string in communication messages may
>> >> be
>> >> broken for some strings on nodes running in a JVM with a different
>> >> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
>> >> serialize strings - [1]
>> >>
>> >> Or the IgniteAuthenticationProcessor can compute different security
>> >> IDs
>> >> for the user on different nodes in this case - [2]
>> >>
>> >> What do you think, if we solve this problem globally, by rejecting to
>> >> join nodes that run on JVMs with different encodings?
>> >>
>> >> As a result, we will be sure that all cluster nodes have the same
>> >> encoding and all related problems will be solved.
>> >>
>> >> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
>> >> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
>> >>
>> >> --
>> >> Mikhail
>> >>
>> >>
>> >
>> > --
>> > Sincerely yours, Ivan Daschinskiy
>>
>>
>
> --
> Sincerely yours, Ivan Daschinskiy
>


-- 

Best regards,
Ivan Pavlukhin


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Daschinsky
UTF-8 is already a default encoding in our BinaryObject format. So I am
for unification.

пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov :

> Hello, Ivan.
>
> UTF-8 can’t encode all UNICODE characters.
>
> > 13 дек. 2021 г., в 12:49, Ivan Daschinsky 
> написал(а):
> >
> > Khm, maybe a better variant is  to enforce all strings to be encoded in
> > UTF-8?
> > AFAIK multi OS cluster is a quite common case.
> >
> >
> > пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :
> >
> >> Igniters,
> >>
> >> Recently we faced the problem that if the cluster consists of nodes
> >> running in the JVM with different encodings, many issues arise.
> >> The root cause of the mentioned issues is components that use
> >> `String#getBytes()` and `new String()`, which relies on the
> >> system default encoding. Thus, if a string is deserialized on a node
> >> with a different encoding from the one that serialized it, the
> >> deserialized string can be different from the original one.
> >>
> >> For example:
> >>
> >> Serialization/deserialization of string in communication messages may be
> >> broken for some strings on nodes running in a JVM with a different
> >> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> >> serialize strings - [1]
> >>
> >> Or the IgniteAuthenticationProcessor can compute different security IDs
> >> for the user on different nodes in this case - [2]
> >>
> >> What do you think, if we solve this problem globally, by rejecting to
> >> join nodes that run on JVMs with different encodings?
> >>
> >> As a result, we will be sure that all cluster nodes have the same
> >> encoding and all related problems will be solved.
> >>
> >> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> >> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
> >>
> >> --
> >> Mikhail
> >>
> >>
> >
> > --
> > Sincerely yours, Ivan Daschinskiy
>
>

-- 
Sincerely yours, Ivan Daschinskiy


Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Nikolay Izhikov
Hello, Ivan.

UTF-8 can’t encode all UNICODE characters.

> 13 дек. 2021 г., в 12:49, Ivan Daschinsky  написал(а):
> 
> Khm, maybe a better variant is  to enforce all strings to be encoded in
> UTF-8?
> AFAIK multi OS cluster is a quite common case.
> 
> 
> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :
> 
>> Igniters,
>> 
>> Recently we faced the problem that if the cluster consists of nodes
>> running in the JVM with different encodings, many issues arise.
>> The root cause of the mentioned issues is components that use
>> `String#getBytes()` and `new String()`, which relies on the
>> system default encoding. Thus, if a string is deserialized on a node
>> with a different encoding from the one that serialized it, the
>> deserialized string can be different from the original one.
>> 
>> For example:
>> 
>> Serialization/deserialization of string in communication messages may be
>> broken for some strings on nodes running in a JVM with a different
>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
>> serialize strings - [1]
>> 
>> Or the IgniteAuthenticationProcessor can compute different security IDs
>> for the user on different nodes in this case - [2]
>> 
>> What do you think, if we solve this problem globally, by rejecting to
>> join nodes that run on JVMs with different encodings?
>> 
>> As a result, we will be sure that all cluster nodes have the same
>> encoding and all related problems will be solved.
>> 
>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
>> 
>> --
>> Mikhail
>> 
>> 
> 
> -- 
> Sincerely yours, Ivan Daschinskiy



Re: [DISCUSSION] Reject join of nodes with different character encodings

2021-12-13 Thread Ivan Daschinsky
Khm, maybe a better variant is  to enforce all strings to be encoded in
UTF-8?
AFAIK multi OS cluster is a quite common case.


пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov :

> Igniters,
>
> Recently we faced the problem that if the cluster consists of nodes
> running in the JVM with different encodings, many issues arise.
> The root cause of the mentioned issues is components that use
> `String#getBytes()` and `new String()`, which relies on the
> system default encoding. Thus, if a string is deserialized on a node
> with a different encoding from the one that serialized it, the
> deserialized string can be different from the original one.
>
> For example:
>
> Serialization/deserialization of string in communication messages may be
> broken for some strings on nodes running in a JVM with a different
> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
> serialize strings - [1]
>
> Or the IgniteAuthenticationProcessor can compute different security IDs
> for the user on different nodes in this case - [2]
>
> What do you think, if we solve this problem globally, by rejecting to
> join nodes that run on JVMs with different encodings?
>
> As a result, we will be sure that all cluster nodes have the same
> encoding and all related problems will be solved.
>
> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
>
> --
> Mikhail
>
>

-- 
Sincerely yours, Ivan Daschinskiy