[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:26 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as strings because nothing else would make sense in 
JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:24 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's now way to distinguish a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's now way to distinguish a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot commented on SPARK-23520:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-02-26 Thread David Courtinot (JIRA)
David Courtinot created SPARK-23520:
---

 Summary: Add support for MapType fields in JSON schema inference
 Key: SPARK-23520
 URL: https://issues.apache.org/jira/browse/SPARK-23520
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.1
Reporter: David Courtinot


_InferSchema_ currently does not support inferring _MapType_ fields from JSON 
data, and for a good reason: they are indistinguishable from structs in JSON 
format. In issue 
[SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed to 
expose some methods of _InferSchema_ to users so that they can build on top of 
the inference primitives defined by this class. In this issue, I'm proposing to 
add more control to the user by letting them specify a set of fields that 
should be forced as _MapType._

*Use-case*

Some JSON datasets contain high-cardinality fields, namely fields which key 
space is very large. These fields shouldn't be interpreted as _StructType_ for 
the following reasons:
 * it's not really what they are. The key space as well as the value space may 
both be infinite, so what best defines the schema of this data is the type of 
the keys and the type of the values, not a struct containing all possible 
key-value pairs.
 * interpreting high-cardinality fields as structs can lead to enormous 
schemata that don't even fit into memory.

*Proposition*

We would add a public overloaded signature for _InferSchema.inferField_ which 
allows to pass a set of field accessors (a class that supports representing the 
access to any JSON field, including nested ones) for which we wan't do not want 
to recurse and instead force a schema. That would allow, in particular, to ask 
that a few fields be inferred as maps rather than structs.

I am very open to discuss this with people who are more well-versed in the 
Spark codebase than me, because I realize my proposition can feel somewhat 
patchy. I'll be more than happy to provide some development effort if we manage 
to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23494) Expose InferSchema's functionalities to the outside

2018-02-23 Thread David Courtinot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Courtinot updated SPARK-23494:

Description: 
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

In another job, we download this data, sample it and infer the schema using 
Dataset.schema(). From there, we convert the data in Parquet and upload it 
somewhere for later querying. This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed to be at least as slow as traversing the sample first, and 
the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields no matter how low is their frequency.
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the schema to an external store, we can load the JSON data in 
a Dataset without ever paying the inference cost again (just the conversion 
from JSON to Row). We keep the advantages and flexibility of JSON while also 
benefiting from the powerful features and optimizations available on Datasets 
or Parquet itself.

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.

  was:
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

In another job, we donwload this data, sample it, infer the schema using 
Dataset.schema(). From there, we output the data in Parquet for later querying. 
This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed to be at least as slow as traversing the sample first, and 
the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.


> Expose InferSchema's functionalities to the outside
> ---
>
> Key: SPARK-23494
> URL: https://issues.apache.org/jira/browse/SPARK-23494
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> I'm proposing that InferSchema's internals (infer the schema of each 

[jira] [Updated] (SPARK-23494) Expose InferSchema's functionalities to the outside

2018-02-23 Thread David Courtinot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Courtinot updated SPARK-23494:

Description: 
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

In another job, we donwload this data, sample it, infer the schema using 
Dataset.schema(). From there, we output the data in Parquet for later querying. 
This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed to be at least as slow as traversing the sample first, and 
the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.

  was:
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) to be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). 
From there, we output the data in Parquet for later querying. This approach has 
proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed at least as slow as traversing the sample first, and the 
whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.


> Expose InferSchema's functionalities to the outside
> ---
>
> Key: SPARK-23494
> URL: https://issues.apache.org/jira/browse/SPARK-23494
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> I'm proposing that InferSchema's internals (infer the schema of each record, 
> merge two schemata, and canonicalize the result) be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and 
> must be very dynamic: fields can appear and go from one day to 

[jira] [Updated] (SPARK-23494) Expose InferSchema's functionalities to the outside

2018-02-23 Thread David Courtinot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Courtinot updated SPARK-23494:

Description: 
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) to be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). 
From there, we output the data in Parquet for later querying. This approach has 
proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed at least as slow as traversing the sample first, and the 
whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.

  was:
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) to be exposed to the outside.

*Use-case*

We continuously produce large amounts of JSON data. The schema is and must be 
very dynamic: fields can appear and go from one day to another, most fields are 
nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). 
From there, we output the data in Parquet for later querying. This approach has 
proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed at least as slow as traversing the sample first, and the 
whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.


> Expose InferSchema's functionalities to the outside
> ---
>
> Key: SPARK-23494
> URL: https://issues.apache.org/jira/browse/SPARK-23494
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> I'm proposing that InferSchema's internals (infer the schema of each record, 
> merge two schemata, and canonicalize the result) to be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and 
> must be very dynamic: fields can appear and go from one day to another, most 
> 

[jira] [Created] (SPARK-23494) Expose InferSchema's functionalities to the outside

2018-02-23 Thread David Courtinot (JIRA)
David Courtinot created SPARK-23494:
---

 Summary: Expose InferSchema's functionalities to the outside
 Key: SPARK-23494
 URL: https://issues.apache.org/jira/browse/SPARK-23494
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.1
Reporter: David Courtinot


I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) to be exposed to the outside.

*Use-case*

We continuously produce large amounts of JSON data. The schema is and must be 
very dynamic: fields can appear and go from one day to another, most fields are 
nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). 
From there, we output the data in Parquet for later querying. This approach has 
proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed at least as slow as traversing the sample first, and the 
whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org