Re: Fixing equality of Rows

2018-10-29 Thread Rui Wang
I might misunderstand what portability is in Beam. If the portability is
designed as each SDK has its own representation of something and after that
it's converted to portable representation, then wrapping byte[] into an
object is fine.

-Rui

On Mon, Oct 29, 2018 at 11:26 AM Gleb Kanterov  wrote:

> Rui, I'm not completely sure I understand why it isn't possible to find
> suitable encoding for portability. As I understand, the only requirement is
> deterministic encoding consistent with equality, so existing representation
> of BYTES will work (VarInt followed by bytes). In my understanding, it's up
> to SDK to decide what is the best representation of byte[] in each
> language. In our case, for Java, we fix equality by wrapping byte[] into an
> object.
>
> On Mon, Oct 29, 2018 at 6:42 PM Lukasz Cwik  wrote:
>
>> I believe Kenn is spot on. The focus of the issue is too narrow as your
>> talking about the short term problem related to Map.
>> Schemas are very similar to coders and coders have been solving this
>> problem by delegating to the underlying component coder to figure out
>> whether two things are equal. You could make all the coders used within
>> Schema's deterministic.
>> For example, what prevents you from using a MapCoder that has been made
>> to be deterministic by sorting the encoded keys lexicographically?
>>
>> On Mon, Oct 29, 2018 at 10:27 AM Rui Wang  wrote:
>>
>>> Seems to me that Only Map's quality check cannot be solved
>>> by deepEquals because Keys cannot be looked up correctly in Map>> ?>.  If we cannot have a useful use case for Map,  we could
>>> reject it in Schema and still keep byte[].
>>>
>>> The option3 needs to find a wrapper of byte[] that is
>>> language-independent & encoding-independent for portability, which seems a
>>> hard searching (and not possible?) process.
>>>
>>> -Rui
>>>
>>> On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov  wrote:
>>>
 There is an indirect connection to RowCoder because `MapCoder` isn't
 deterministic, therefore, this doesn't hold:

 >  - also each type (hence Row type) should have portable encoding(s)
 that respect this equality so shuffling is consistent

 I think it's a requirement only for rows we want to shuffle by.

 > About these specific use cases, how useful is it to support
 Map and List?

 Not sure about Map, but in BigQuery it's possible to define
 `ARRAY` type. It can group by BYTES, but not by ARRAYS.


 On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:

> About these specific use cases, how useful is it to support
> Map and List? These seem pretty exotic (maybe they
> aren't) and I wonder whether it would make sense to just reject them until
> we have a solid design.
>
> And wouldn't the same problems arise even without RowCoder? Is the
> path in that case to implement a custom coder?
>
> Regards,
> Anton
>
>
> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles 
> wrote:
>
>> I'll summarize my input to the discussion. It is rather high level.
>> But IMO:
>>
>>  - even though schemas are part of Beam Java today, I think they
>> should become part of portability when ready
>>  - so each type in a schema needs a language-independent &
>> encoding-independent notion of domain of values and equality (so 
>> obviously
>> equal bytes are equal)
>>  - embedding in any language (hence Row in Java) must have a schema
>> type-driven equality that matches this spec
>>  - also each type (hence Row type) should have portable encoding(s)
>> that respect this equality so shuffling is consistent
>>  - Row in Java should be able to decode these encodings to different
>> underlying representations and change its strategy over time
>>
>> Kenn
>>
>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov 
>> wrote:
>>
>>> With adding BYTES type, we broke equality.
>>> `RowCoder#consistentWithEquals` is always true, but this property 
>>> doesn't
>>> hold for exotic types such as `Map`, `List`. The root
>>> cause is `byte[]`, where `equals` is implemented as reference equality
>>> instead of structural.
>>>
>>> Before we jump into solution mode, let's state what we want to have:
>>> - *API* have stable API and be able to evolve efficient and
>>> use-case specific implementations without breaking it
>>> - *Correctness *we can't trade off correctness, a trivial
>>> implementation should work
>>> - *Performance *comparing equality is a fundamental operation, and
>>> we want to make it cheap
>>>
>>> 1. set `consistentWithEquals` if there is BYTES field
>>> Pros: almost no pros
>>> Cons: It would introduce a significant number of allocations when
>>> comparing rows, so we reject this option.
>>>
>>> 2. implement custom deep equals in `Row#equals`
>>> Pros: good 

Re: Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov
Rui, I'm not completely sure I understand why it isn't possible to find
suitable encoding for portability. As I understand, the only requirement is
deterministic encoding consistent with equality, so existing representation
of BYTES will work (VarInt followed by bytes). In my understanding, it's up
to SDK to decide what is the best representation of byte[] in each
language. In our case, for Java, we fix equality by wrapping byte[] into an
object.

On Mon, Oct 29, 2018 at 6:42 PM Lukasz Cwik  wrote:

> I believe Kenn is spot on. The focus of the issue is too narrow as your
> talking about the short term problem related to Map.
> Schemas are very similar to coders and coders have been solving this
> problem by delegating to the underlying component coder to figure out
> whether two things are equal. You could make all the coders used within
> Schema's deterministic.
> For example, what prevents you from using a MapCoder that has been made to
> be deterministic by sorting the encoded keys lexicographically?
>
> On Mon, Oct 29, 2018 at 10:27 AM Rui Wang  wrote:
>
>> Seems to me that Only Map's quality check cannot be solved by
>> deepEquals because Keys cannot be looked up correctly in Map.
>> If we cannot have a useful use case for Map,  we could reject
>> it in Schema and still keep byte[].
>>
>> The option3 needs to find a wrapper of byte[] that is
>> language-independent & encoding-independent for portability, which seems a
>> hard searching (and not possible?) process.
>>
>> -Rui
>>
>> On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov  wrote:
>>
>>> There is an indirect connection to RowCoder because `MapCoder` isn't
>>> deterministic, therefore, this doesn't hold:
>>>
>>> >  - also each type (hence Row type) should have portable encoding(s)
>>> that respect this equality so shuffling is consistent
>>>
>>> I think it's a requirement only for rows we want to shuffle by.
>>>
>>> > About these specific use cases, how useful is it to support
>>> Map and List?
>>>
>>> Not sure about Map, but in BigQuery it's possible to define
>>> `ARRAY` type. It can group by BYTES, but not by ARRAYS.
>>>
>>>
>>> On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:
>>>
 About these specific use cases, how useful is it to support Map>>> ?> and List? These seem pretty exotic (maybe they aren't) and I
 wonder whether it would make sense to just reject them until we have a
 solid design.

 And wouldn't the same problems arise even without RowCoder? Is the path
 in that case to implement a custom coder?

 Regards,
 Anton


 On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles 
 wrote:

> I'll summarize my input to the discussion. It is rather high level.
> But IMO:
>
>  - even though schemas are part of Beam Java today, I think they
> should become part of portability when ready
>  - so each type in a schema needs a language-independent &
> encoding-independent notion of domain of values and equality (so obviously
> equal bytes are equal)
>  - embedding in any language (hence Row in Java) must have a schema
> type-driven equality that matches this spec
>  - also each type (hence Row type) should have portable encoding(s)
> that respect this equality so shuffling is consistent
>  - Row in Java should be able to decode these encodings to different
> underlying representations and change its strategy over time
>
> Kenn
>
> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov 
> wrote:
>
>> With adding BYTES type, we broke equality.
>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>> hold for exotic types such as `Map`, `List`. The root
>> cause is `byte[]`, where `equals` is implemented as reference equality
>> instead of structural.
>>
>> Before we jump into solution mode, let's state what we want to have:
>> - *API* have stable API and be able to evolve efficient and use-case
>> specific implementations without breaking it
>> - *Correctness *we can't trade off correctness, a trivial
>> implementation should work
>> - *Performance *comparing equality is a fundamental operation, and
>> we want to make it cheap
>>
>> 1. set `consistentWithEquals` if there is BYTES field
>> Pros: almost no pros
>> Cons: It would introduce a significant number of allocations when
>> comparing rows, so we reject this option.
>>
>> 2. implement custom deep equals in `Row#equals`
>> Pros: good performance, doesn't change API, `Row#equals` is correct
>> Cons: doesn't work for `Map`, unless we roll own
>> implementation
>> Cons: it's possible to obtain `List` from `getValue()` that
>> has broken equality, contains, etc, unless we roll own implementation
>> Cons: non-trivial and requires ~200LOC to implement
>>
>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>> StructuralByteArray)

Re: Fixing equality of Rows

2018-10-29 Thread Lukasz Cwik
I believe Kenn is spot on. The focus of the issue is too narrow as your
talking about the short term problem related to Map.
Schemas are very similar to coders and coders have been solving this
problem by delegating to the underlying component coder to figure out
whether two things are equal. You could make all the coders used within
Schema's deterministic.
For example, what prevents you from using a MapCoder that has been made to
be deterministic by sorting the encoded keys lexicographically?

On Mon, Oct 29, 2018 at 10:27 AM Rui Wang  wrote:

> Seems to me that Only Map's quality check cannot be solved by
> deepEquals because Keys cannot be looked up correctly in Map.
> If we cannot have a useful use case for Map,  we could reject
> it in Schema and still keep byte[].
>
> The option3 needs to find a wrapper of byte[] that is language-independent
> & encoding-independent for portability, which seems a hard searching (and
> not possible?) process.
>
> -Rui
>
> On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov  wrote:
>
>> There is an indirect connection to RowCoder because `MapCoder` isn't
>> deterministic, therefore, this doesn't hold:
>>
>> >  - also each type (hence Row type) should have portable encoding(s)
>> that respect this equality so shuffling is consistent
>>
>> I think it's a requirement only for rows we want to shuffle by.
>>
>> > About these specific use cases, how useful is it to support Map> ?> and List?
>>
>> Not sure about Map, but in BigQuery it's possible to define
>> `ARRAY` type. It can group by BYTES, but not by ARRAYS.
>>
>>
>> On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:
>>
>>> About these specific use cases, how useful is it to support Map>> ?> and List? These seem pretty exotic (maybe they aren't) and I
>>> wonder whether it would make sense to just reject them until we have a
>>> solid design.
>>>
>>> And wouldn't the same problems arise even without RowCoder? Is the path
>>> in that case to implement a custom coder?
>>>
>>> Regards,
>>> Anton
>>>
>>>
>>> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles  wrote:
>>>
 I'll summarize my input to the discussion. It is rather high level. But
 IMO:

  - even though schemas are part of Beam Java today, I think they should
 become part of portability when ready
  - so each type in a schema needs a language-independent &
 encoding-independent notion of domain of values and equality (so obviously
 equal bytes are equal)
  - embedding in any language (hence Row in Java) must have a schema
 type-driven equality that matches this spec
  - also each type (hence Row type) should have portable encoding(s)
 that respect this equality so shuffling is consistent
  - Row in Java should be able to decode these encodings to different
 underlying representations and change its strategy over time

 Kenn

 On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:

> With adding BYTES type, we broke equality.
> `RowCoder#consistentWithEquals` is always true, but this property doesn't
> hold for exotic types such as `Map`, `List`. The root
> cause is `byte[]`, where `equals` is implemented as reference equality
> instead of structural.
>
> Before we jump into solution mode, let's state what we want to have:
> - *API* have stable API and be able to evolve efficient and use-case
> specific implementations without breaking it
> - *Correctness *we can't trade off correctness, a trivial
> implementation should work
> - *Performance *comparing equality is a fundamental operation, and we
> want to make it cheap
>
> 1. set `consistentWithEquals` if there is BYTES field
> Pros: almost no pros
> Cons: It would introduce a significant number of allocations when
> comparing rows, so we reject this option.
>
> 2. implement custom deep equals in `Row#equals`
> Pros: good performance, doesn't change API, `Row#equals` is correct
> Cons: doesn't work for `Map`, unless we roll own
> implementation
> Cons: it's possible to obtain `List` from `getValue()` that
> has broken equality, contains, etc, unless we roll own implementation
> Cons: non-trivial and requires ~200LOC to implement
>
> 3. wrapping byte[] into Java object with fixed equality (e.g.,
> StructuralByteArray)
> Pros: good performance and flexible to change how Java wrapper is
> implemented
> Pros: simple, doesn't require any specialized collections, no
> surprises, `Map` and `List` work.
> Cons: will change the return type of `Row#getValue`
>
> I want to suggest going with option #3. However, it isn't completely
> clear what wrapper we want to use, either it could be StructuralByteArray,
> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
> additional integer fields. StructuralByteArray doesn't have anything not
> necessary. One option would be adding 

Re: Fixing equality of Rows

2018-10-29 Thread Rui Wang
Seems to me that Only Map's quality check cannot be solved by
deepEquals because Keys cannot be looked up correctly in Map.
If we cannot have a useful use case for Map,  we could reject
it in Schema and still keep byte[].

The option3 needs to find a wrapper of byte[] that is language-independent
& encoding-independent for portability, which seems a hard searching (and
not possible?) process.

-Rui

On Mon, Oct 29, 2018 at 10:15 AM Gleb Kanterov  wrote:

> There is an indirect connection to RowCoder because `MapCoder` isn't
> deterministic, therefore, this doesn't hold:
>
> >  - also each type (hence Row type) should have portable encoding(s) that
> respect this equality so shuffling is consistent
>
> I think it's a requirement only for rows we want to shuffle by.
>
> > About these specific use cases, how useful is it to support Map ?> and List?
>
> Not sure about Map, but in BigQuery it's possible to define `ARRAY`
> type. It can group by BYTES, but not by ARRAYS.
>
>
> On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:
>
>> About these specific use cases, how useful is it to support Map> ?> and List? These seem pretty exotic (maybe they aren't) and I
>> wonder whether it would make sense to just reject them until we have a
>> solid design.
>>
>> And wouldn't the same problems arise even without RowCoder? Is the path
>> in that case to implement a custom coder?
>>
>> Regards,
>> Anton
>>
>>
>> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles  wrote:
>>
>>> I'll summarize my input to the discussion. It is rather high level. But
>>> IMO:
>>>
>>>  - even though schemas are part of Beam Java today, I think they should
>>> become part of portability when ready
>>>  - so each type in a schema needs a language-independent &
>>> encoding-independent notion of domain of values and equality (so obviously
>>> equal bytes are equal)
>>>  - embedding in any language (hence Row in Java) must have a schema
>>> type-driven equality that matches this spec
>>>  - also each type (hence Row type) should have portable encoding(s) that
>>> respect this equality so shuffling is consistent
>>>  - Row in Java should be able to decode these encodings to different
>>> underlying representations and change its strategy over time
>>>
>>> Kenn
>>>
>>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:
>>>
 With adding BYTES type, we broke equality.
 `RowCoder#consistentWithEquals` is always true, but this property doesn't
 hold for exotic types such as `Map`, `List`. The root
 cause is `byte[]`, where `equals` is implemented as reference equality
 instead of structural.

 Before we jump into solution mode, let's state what we want to have:
 - *API* have stable API and be able to evolve efficient and use-case
 specific implementations without breaking it
 - *Correctness *we can't trade off correctness, a trivial
 implementation should work
 - *Performance *comparing equality is a fundamental operation, and we
 want to make it cheap

 1. set `consistentWithEquals` if there is BYTES field
 Pros: almost no pros
 Cons: It would introduce a significant number of allocations when
 comparing rows, so we reject this option.

 2. implement custom deep equals in `Row#equals`
 Pros: good performance, doesn't change API, `Row#equals` is correct
 Cons: doesn't work for `Map`, unless we roll own
 implementation
 Cons: it's possible to obtain `List` from `getValue()` that has
 broken equality, contains, etc, unless we roll own implementation
 Cons: non-trivial and requires ~200LOC to implement

 3. wrapping byte[] into Java object with fixed equality (e.g.,
 StructuralByteArray)
 Pros: good performance and flexible to change how Java wrapper is
 implemented
 Pros: simple, doesn't require any specialized collections, no
 surprises, `Map` and `List` work.
 Cons: will change the return type of `Row#getValue`

 I want to suggest going with option #3. However, it isn't completely
 clear what wrapper we want to use, either it could be StructuralByteArray,
 or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
 additional integer fields. StructuralByteArray doesn't have anything not
 necessary. One option would be adding `Row#getByteBuffer` that would be
 `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
 override it for better performance, but `getValue(i)` must return
 StructuralByteArray.

 References:
 - [BEAM-5866] Fix `Row#equals`,
 https://github.com/apache/beam/pull/6845
 - [BEAM-5646] Fix quality and hashcode for bytes in Row,
 https://github.com/apache/beam/pull/6765

 Gleb

>>>
>
> --
> Cheers,
> Gleb
>


Re: Fixing equality of Rows

2018-10-29 Thread Gleb Kanterov
There is an indirect connection to RowCoder because `MapCoder` isn't
deterministic, therefore, this doesn't hold:

>  - also each type (hence Row type) should have portable encoding(s) that
respect this equality so shuffling is consistent

I think it's a requirement only for rows we want to shuffle by.

> About these specific use cases, how useful is it to support Map and List?

Not sure about Map, but in BigQuery it's possible to define `ARRAY`
type. It can group by BYTES, but not by ARRAYS.


On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin  wrote:

> About these specific use cases, how useful is it to support Map
> and List? These seem pretty exotic (maybe they aren't) and I wonder
> whether it would make sense to just reject them until we have a solid
> design.
>
> And wouldn't the same problems arise even without RowCoder? Is the path in
> that case to implement a custom coder?
>
> Regards,
> Anton
>
>
> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles  wrote:
>
>> I'll summarize my input to the discussion. It is rather high level. But
>> IMO:
>>
>>  - even though schemas are part of Beam Java today, I think they should
>> become part of portability when ready
>>  - so each type in a schema needs a language-independent &
>> encoding-independent notion of domain of values and equality (so obviously
>> equal bytes are equal)
>>  - embedding in any language (hence Row in Java) must have a schema
>> type-driven equality that matches this spec
>>  - also each type (hence Row type) should have portable encoding(s) that
>> respect this equality so shuffling is consistent
>>  - Row in Java should be able to decode these encodings to different
>> underlying representations and change its strategy over time
>>
>> Kenn
>>
>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:
>>
>>> With adding BYTES type, we broke equality.
>>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>>> hold for exotic types such as `Map`, `List`. The root
>>> cause is `byte[]`, where `equals` is implemented as reference equality
>>> instead of structural.
>>>
>>> Before we jump into solution mode, let's state what we want to have:
>>> - *API* have stable API and be able to evolve efficient and use-case
>>> specific implementations without breaking it
>>> - *Correctness *we can't trade off correctness, a trivial
>>> implementation should work
>>> - *Performance *comparing equality is a fundamental operation, and we
>>> want to make it cheap
>>>
>>> 1. set `consistentWithEquals` if there is BYTES field
>>> Pros: almost no pros
>>> Cons: It would introduce a significant number of allocations when
>>> comparing rows, so we reject this option.
>>>
>>> 2. implement custom deep equals in `Row#equals`
>>> Pros: good performance, doesn't change API, `Row#equals` is correct
>>> Cons: doesn't work for `Map`, unless we roll own
>>> implementation
>>> Cons: it's possible to obtain `List` from `getValue()` that has
>>> broken equality, contains, etc, unless we roll own implementation
>>> Cons: non-trivial and requires ~200LOC to implement
>>>
>>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>>> StructuralByteArray)
>>> Pros: good performance and flexible to change how Java wrapper is
>>> implemented
>>> Pros: simple, doesn't require any specialized collections, no surprises,
>>> `Map` and `List` work.
>>> Cons: will change the return type of `Row#getValue`
>>>
>>> I want to suggest going with option #3. However, it isn't completely
>>> clear what wrapper we want to use, either it could be StructuralByteArray,
>>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
>>> additional integer fields. StructuralByteArray doesn't have anything not
>>> necessary. One option would be adding `Row#getByteBuffer` that would be
>>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
>>> override it for better performance, but `getValue(i)` must return
>>> StructuralByteArray.
>>>
>>> References:
>>> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
>>> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
>>> https://github.com/apache/beam/pull/6765
>>>
>>> Gleb
>>>
>>

-- 
Cheers,
Gleb


Re: Fixing equality of Rows

2018-10-29 Thread Anton Kedin
About these specific use cases, how useful is it to support Map
and List? These seem pretty exotic (maybe they aren't) and I wonder
whether it would make sense to just reject them until we have a solid
design.

And wouldn't the same problems arise even without RowCoder? Is the path in
that case to implement a custom coder?

Regards,
Anton


On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles  wrote:

> I'll summarize my input to the discussion. It is rather high level. But
> IMO:
>
>  - even though schemas are part of Beam Java today, I think they should
> become part of portability when ready
>  - so each type in a schema needs a language-independent &
> encoding-independent notion of domain of values and equality (so obviously
> equal bytes are equal)
>  - embedding in any language (hence Row in Java) must have a schema
> type-driven equality that matches this spec
>  - also each type (hence Row type) should have portable encoding(s) that
> respect this equality so shuffling is consistent
>  - Row in Java should be able to decode these encodings to different
> underlying representations and change its strategy over time
>
> Kenn
>
> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:
>
>> With adding BYTES type, we broke equality.
>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>> hold for exotic types such as `Map`, `List`. The root
>> cause is `byte[]`, where `equals` is implemented as reference equality
>> instead of structural.
>>
>> Before we jump into solution mode, let's state what we want to have:
>> - *API* have stable API and be able to evolve efficient and use-case
>> specific implementations without breaking it
>> - *Correctness *we can't trade off correctness, a trivial implementation
>> should work
>> - *Performance *comparing equality is a fundamental operation, and we
>> want to make it cheap
>>
>> 1. set `consistentWithEquals` if there is BYTES field
>> Pros: almost no pros
>> Cons: It would introduce a significant number of allocations when
>> comparing rows, so we reject this option.
>>
>> 2. implement custom deep equals in `Row#equals`
>> Pros: good performance, doesn't change API, `Row#equals` is correct
>> Cons: doesn't work for `Map`, unless we roll own implementation
>> Cons: it's possible to obtain `List` from `getValue()` that has
>> broken equality, contains, etc, unless we roll own implementation
>> Cons: non-trivial and requires ~200LOC to implement
>>
>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>> StructuralByteArray)
>> Pros: good performance and flexible to change how Java wrapper is
>> implemented
>> Pros: simple, doesn't require any specialized collections, no surprises,
>> `Map` and `List` work.
>> Cons: will change the return type of `Row#getValue`
>>
>> I want to suggest going with option #3. However, it isn't completely
>> clear what wrapper we want to use, either it could be StructuralByteArray,
>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
>> additional integer fields. StructuralByteArray doesn't have anything not
>> necessary. One option would be adding `Row#getByteBuffer` that would be
>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
>> override it for better performance, but `getValue(i)` must return
>> StructuralByteArray.
>>
>> References:
>> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
>> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
>> https://github.com/apache/beam/pull/6765
>>
>> Gleb
>>
>


Re: Fixing equality of Rows

2018-10-29 Thread Kenneth Knowles
I'll summarize my input to the discussion. It is rather high level. But IMO:

 - even though schemas are part of Beam Java today, I think they should
become part of portability when ready
 - so each type in a schema needs a language-independent &
encoding-independent notion of domain of values and equality (so obviously
equal bytes are equal)
 - embedding in any language (hence Row in Java) must have a schema
type-driven equality that matches this spec
 - also each type (hence Row type) should have portable encoding(s) that
respect this equality so shuffling is consistent
 - Row in Java should be able to decode these encodings to different
underlying representations and change its strategy over time

Kenn

On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:

> With adding BYTES type, we broke equality. `RowCoder#consistentWithEquals`
> is always true, but this property doesn't hold for exotic types such as
> `Map`, `List`. The root cause is `byte[]`, where `equals`
> is implemented as reference equality instead of structural.
>
> Before we jump into solution mode, let's state what we want to have:
> - *API* have stable API and be able to evolve efficient and use-case
> specific implementations without breaking it
> - *Correctness *we can't trade off correctness, a trivial implementation
> should work
> - *Performance *comparing equality is a fundamental operation, and we
> want to make it cheap
>
> 1. set `consistentWithEquals` if there is BYTES field
> Pros: almost no pros
> Cons: It would introduce a significant number of allocations when
> comparing rows, so we reject this option.
>
> 2. implement custom deep equals in `Row#equals`
> Pros: good performance, doesn't change API, `Row#equals` is correct
> Cons: doesn't work for `Map`, unless we roll own implementation
> Cons: it's possible to obtain `List` from `getValue()` that has
> broken equality, contains, etc, unless we roll own implementation
> Cons: non-trivial and requires ~200LOC to implement
>
> 3. wrapping byte[] into Java object with fixed equality (e.g.,
> StructuralByteArray)
> Pros: good performance and flexible to change how Java wrapper is
> implemented
> Pros: simple, doesn't require any specialized collections, no surprises,
> `Map` and `List` work.
> Cons: will change the return type of `Row#getValue`
>
> I want to suggest going with option #3. However, it isn't completely clear
> what wrapper we want to use, either it could be StructuralByteArray, or
> ByteBuffer. ByteBuffer is more standard. However, it comes with 4
> additional integer fields. StructuralByteArray doesn't have anything not
> necessary. One option would be adding `Row#getByteBuffer` that would be
> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
> override it for better performance, but `getValue(i)` must return
> StructuralByteArray.
>
> References:
> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
> https://github.com/apache/beam/pull/6765
>
> Gleb
>