Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Michael Armbrust
>
> Additionally, shall I go ahead and open a ticket pointing out the missing
> call to .asNullable in the streaming reader?
>

Yes please! This probably affects correctness.


Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Aleksander Eskilson
I've opened a Jira to the issue you requested earlier:
https://issues.apache.org/jira/browse/SPARK-17939


On Fri, Oct 14, 2016 at 9:24 AM Aleksander Eskilson 
wrote:

> Interesting. I'm quite glad to read your explanation, it makes some of our
> work quite a bit more clear. I'll open a ticket in a similar vein to this
> discussion: https://github.com/apache/spark/pull/11785, contrasting
> nullability implementation as optimization and and enforcement.
>
> Additionally, shall I go ahead and open a ticket pointing out the missing
> call to .asNullable in the streaming reader?
>
> Thanks,
> Alek
>
> On Thu, Oct 13, 2016 at 4:05 PM Michael Armbrust 
> wrote:
>
> There is a  lot
>  of
>  confusion
>  around
>  nullable
>  in
>  StructType
>  and we should
> definitly come up with a consistent story and make sure we have better
> documentation.  This might even mean deprecating this field.  At a high
> level, I think the key problem is that internally, nullable is used as an
> optimization, not an enforcement mechanism.  This is a lot different than
> NOT NULL in a traditional database. Specifically, we take "nulllable =
> false" as a promise that that column will never be null and use that fact
> to do optimizations like skipping null checks.  This means that if you lie
> to us, you actually can get wrong answers (i.e. 0 instead of null).  This
> is clearly confusing and not ideal.
>
> A little bit of explanation for some of the specific cases you brought up:
> the reason that we call asNullable on file sources is that even if that
> column is never null in the data, there are cases that can still produce a
> null result.  For example, when parsing JSON, we null out all columns other
> than _corrupt_record when we fail to parse a line.  The fact that its
> different in streaming is a bug.
>
> Would you mind opening up a JIRA ticket, and we discuss the right path
> forward there?
>
> On Thu, Oct 13, 2016 at 12:35 PM, Aleksander Eskilson <
> aleksander...@gmail.com> wrote:
>
> Hi there,
>
> Working in the space of custom Encoders/ExpressionEncoders, I've noticed
> that the StructType schema as set when creating an object of the
> ExpressionEncoder[T] class [1] is not the schema actually used to set types
> for the columns of a Dataset, as created by using the .as(encoder) method
> [2] on read data. Instead, what occurs is that the schema is either
> inferred through analysis of the data, or a schema can be provided using
> the .schema(structType) method [3] of the DataFrameReader. However, when
> using the .schema(..) method of DataFrameReader, potentially undesirable
> behaviour occurs: while the DataSource is being resolved, all FieldTypes of
> the a StructType schema have their nullability set to *true* (using the
> asNullable function of StructTypes) [4] when the data is read from a local
> file, as opposed to a non-streaming source.
>
> Of course, allowing null-values where they shouldn't exist can weaken the
> type-guarantees for DataSets over certain types of encoded data.
>
> Thinking on how this might be resolved, first, if it's a legitimate bug,
> I'm not sure why "non-streaming file based" datasources need to have their
> StructFields all rendered nullable. Simply removing the call to asNullable
> would fix the issue. Second, if it's actually necessary for most
> filesystem-read data-sources to have their StructFields potentially
> nullable in this manner, we could instead let the StructType schema
> provided to the Encoder have the final say in the DataSet's schema.
>
> This latter option seems sensible to me: if a client is willing to provide
> a custom Encoder via the .as(..) method on the reader, presumably in
> setting the schema field of the encoder they have some legitimate notion of
> how their object's types should be mapped to DataSet column types. Any
> failure when resolving their data to a DataSet by means of their Encoder
> can then be traced to their Encoder for their own debugging.
>
> Thoughts? Thanks,
> Alek Eskilson
>
> [1] -
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213
> [2] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374
> [3] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62
> [4] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426
>
>
>


Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Aleksander Eskilson
Interesting. I'm quite glad to read your explanation, it makes some of our
work quite a bit more clear. I'll open a ticket in a similar vein to this
discussion: https://github.com/apache/spark/pull/11785, contrasting
nullability implementation as optimization and and enforcement.

Additionally, shall I go ahead and open a ticket pointing out the missing
call to .asNullable in the streaming reader?

Thanks,
Alek

On Thu, Oct 13, 2016 at 4:05 PM Michael Armbrust 
wrote:

There is a  lot
 of
 confusion
 around
 nullable
 in
 StructType
 and we should definitly
come up with a consistent story and make sure we have better
documentation.  This might even mean deprecating this field.  At a high
level, I think the key problem is that internally, nullable is used as an
optimization, not an enforcement mechanism.  This is a lot different than
NOT NULL in a traditional database. Specifically, we take "nulllable =
false" as a promise that that column will never be null and use that fact
to do optimizations like skipping null checks.  This means that if you lie
to us, you actually can get wrong answers (i.e. 0 instead of null).  This
is clearly confusing and not ideal.

A little bit of explanation for some of the specific cases you brought up:
the reason that we call asNullable on file sources is that even if that
column is never null in the data, there are cases that can still produce a
null result.  For example, when parsing JSON, we null out all columns other
than _corrupt_record when we fail to parse a line.  The fact that its
different in streaming is a bug.

Would you mind opening up a JIRA ticket, and we discuss the right path
forward there?

On Thu, Oct 13, 2016 at 12:35 PM, Aleksander Eskilson <
aleksander...@gmail.com> wrote:

Hi there,

Working in the space of custom Encoders/ExpressionEncoders, I've noticed
that the StructType schema as set when creating an object of the
ExpressionEncoder[T] class [1] is not the schema actually used to set types
for the columns of a Dataset, as created by using the .as(encoder) method
[2] on read data. Instead, what occurs is that the schema is either
inferred through analysis of the data, or a schema can be provided using
the .schema(structType) method [3] of the DataFrameReader. However, when
using the .schema(..) method of DataFrameReader, potentially undesirable
behaviour occurs: while the DataSource is being resolved, all FieldTypes of
the a StructType schema have their nullability set to *true* (using the
asNullable function of StructTypes) [4] when the data is read from a local
file, as opposed to a non-streaming source.

Of course, allowing null-values where they shouldn't exist can weaken the
type-guarantees for DataSets over certain types of encoded data.

Thinking on how this might be resolved, first, if it's a legitimate bug,
I'm not sure why "non-streaming file based" datasources need to have their
StructFields all rendered nullable. Simply removing the call to asNullable
would fix the issue. Second, if it's actually necessary for most
filesystem-read data-sources to have their StructFields potentially
nullable in this manner, we could instead let the StructType schema
provided to the Encoder have the final say in the DataSet's schema.

This latter option seems sensible to me: if a client is willing to provide
a custom Encoder via the .as(..) method on the reader, presumably in
setting the schema field of the encoder they have some legitimate notion of
how their object's types should be mapped to DataSet column types. Any
failure when resolving their data to a DataSet by means of their Encoder
can then be traced to their Encoder for their own debugging.

Thoughts? Thanks,
Alek Eskilson

[1] -
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213
[2] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374
[3] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62
[4] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426


Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-13 Thread Michael Armbrust
There is a  lot
 of
 confusion
 around
 nullable
 in
 StructType
 and we should definitly
come up with a consistent story and make sure we have better
documentation.  This might even mean deprecating this field.  At a high
level, I think the key problem is that internally, nullable is used as an
optimization, not an enforcement mechanism.  This is a lot different than
NOT NULL in a traditional database. Specifically, we take "nulllable =
false" as a promise that that column will never be null and use that fact
to do optimizations like skipping null checks.  This means that if you lie
to us, you actually can get wrong answers (i.e. 0 instead of null).  This
is clearly confusing and not ideal.

A little bit of explanation for some of the specific cases you brought up:
the reason that we call asNullable on file sources is that even if that
column is never null in the data, there are cases that can still produce a
null result.  For example, when parsing JSON, we null out all columns other
than _corrupt_record when we fail to parse a line.  The fact that its
different in streaming is a bug.

Would you mind opening up a JIRA ticket, and we discuss the right path
forward there?

On Thu, Oct 13, 2016 at 12:35 PM, Aleksander Eskilson <
aleksander...@gmail.com> wrote:

> Hi there,
>
> Working in the space of custom Encoders/ExpressionEncoders, I've noticed
> that the StructType schema as set when creating an object of the
> ExpressionEncoder[T] class [1] is not the schema actually used to set types
> for the columns of a Dataset, as created by using the .as(encoder) method
> [2] on read data. Instead, what occurs is that the schema is either
> inferred through analysis of the data, or a schema can be provided using
> the .schema(structType) method [3] of the DataFrameReader. However, when
> using the .schema(..) method of DataFrameReader, potentially undesirable
> behaviour occurs: while the DataSource is being resolved, all FieldTypes of
> the a StructType schema have their nullability set to *true* (using the
> asNullable function of StructTypes) [4] when the data is read from a local
> file, as opposed to a non-streaming source.
>
> Of course, allowing null-values where they shouldn't exist can weaken the
> type-guarantees for DataSets over certain types of encoded data.
>
> Thinking on how this might be resolved, first, if it's a legitimate bug,
> I'm not sure why "non-streaming file based" datasources need to have their
> StructFields all rendered nullable. Simply removing the call to asNullable
> would fix the issue. Second, if it's actually necessary for most
> filesystem-read data-sources to have their StructFields potentially
> nullable in this manner, we could instead let the StructType schema
> provided to the Encoder have the final say in the DataSet's schema.
>
> This latter option seems sensible to me: if a client is willing to provide
> a custom Encoder via the .as(..) method on the reader, presumably in
> setting the schema field of the encoder they have some legitimate notion of
> how their object's types should be mapped to DataSet column types. Any
> failure when resolving their data to a DataSet by means of their Encoder
> can then be traced to their Encoder for their own debugging.
>
> Thoughts? Thanks,
> Alek Eskilson
>
> [1] - https://github.com/apache/spark/blob/master/sql/
> catalyst/src/main/scala/org/apache/spark/sql/catalyst/
> encoders/ExpressionEncoder.scala#L213
> [2] - https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/Dataset.scala#L374
> [3] - https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62
> [4] - https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/
> datasources/DataSource.scala#L426
>


DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-13 Thread Aleksander Eskilson
Hi there,

Working in the space of custom Encoders/ExpressionEncoders, I've noticed
that the StructType schema as set when creating an object of the
ExpressionEncoder[T] class [1] is not the schema actually used to set types
for the columns of a Dataset, as created by using the .as(encoder) method
[2] on read data. Instead, what occurs is that the schema is either
inferred through analysis of the data, or a schema can be provided using
the .schema(structType) method [3] of the DataFrameReader. However, when
using the .schema(..) method of DataFrameReader, potentially undesirable
behaviour occurs: while the DataSource is being resolved, all FieldTypes of
the a StructType schema have their nullability set to *true* (using the
asNullable function of StructTypes) [4] when the data is read from a local
file, as opposed to a non-streaming source.

Of course, allowing null-values where they shouldn't exist can weaken the
type-guarantees for DataSets over certain types of encoded data.

Thinking on how this might be resolved, first, if it's a legitimate bug,
I'm not sure why "non-streaming file based" datasources need to have their
StructFields all rendered nullable. Simply removing the call to asNullable
would fix the issue. Second, if it's actually necessary for most
filesystem-read data-sources to have their StructFields potentially
nullable in this manner, we could instead let the StructType schema
provided to the Encoder have the final say in the DataSet's schema.

This latter option seems sensible to me: if a client is willing to provide
a custom Encoder via the .as(..) method on the reader, presumably in
setting the schema field of the encoder they have some legitimate notion of
how their object's types should be mapped to DataSet column types. Any
failure when resolving their data to a DataSet by means of their Encoder
can then be traced to their Encoder for their own debugging.

Thoughts? Thanks,
Alek Eskilson

[1] -
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213
[2] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374
[3] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62
[4] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426