[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:26 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as strings because nothing else would make sense in 
JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:24 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

I should probably throw an exception when a field in the set is anything else 
than a JSON object now that I think of it.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's no way of distinguishing a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's now way to distinguish a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 

[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference

2018-03-01 Thread David Courtinot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649
 ] 

David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM:
-

Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

My current approach is the following:
 * since there's now way to distinguish a map from a struct in JSON, I allow 
the user to pass a set of optional fields (nested fields are supported) that 
they want to be inferred as maps
 * I modified _inferField_ to track the current field path in order to compare 
it with the user-provided paths
 * when I encounter an object, I infer all the _StructField_(s) as usual, but I 
also check whether the field path is included in the user-provided set. If it 
is, I reduce the _StructField_(s) to a single _DataType_ by calling 
_compatibleRootType_ on their _dataType()_
 * I always infer the keys as maps because nothing else would make sense in JSON
 * I added a clause in _compatibleType_ in order to merge the value types of 
two _MapType_. I check that both maps have _StringType_ as their key type.


was (Author: dicee):
Good catch. I think the issue is very similar indeed. However, it seems like 
the issue I created better outlines the pros of taking care of this problem. 
For what it's worth, I should mention that I'm currently experimenting with my 
own Java (had to rewrite it for my team's codebase unfortunately) version which 
supports this feature. I hope to be able to propose a pull request in Scala 
soon-ish.

> Add support for MapType fields in JSON schema inference
> ---
>
> Key: SPARK-23520
> URL: https://issues.apache.org/jira/browse/SPARK-23520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: David Courtinot
>Priority: Major
>
> _InferSchema_ currently does not support inferring _MapType_ fields from JSON 
> data, and for a good reason: they are indistinguishable from structs in JSON 
> format. In issue 
> [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed 
> to expose some methods of _InferSchema_ to users so that they can build on 
> top of the inference primitives defined by this class. In this issue, I'm 
> proposing to add more control to the user by letting them specify a set of 
> fields that should be forced as _MapType._
> *Use-case*
> Some JSON datasets contain high-cardinality fields, namely fields which key 
> space is very large. These fields shouldn't be interpreted as _StructType_ 
> for the following reasons:
>  * it's not really what they are. The key space as well as the value space 
> may both be infinite, so what best defines the schema of this data is the 
> type of the keys and the type of the values, not a struct containing all 
> possible key-value pairs.
>  * interpreting high-cardinality fields as structs can lead to enormous 
> schemata that don't even fit into memory.
> *Proposition*
> We would add a public overloaded signature for _InferSchema.inferField_ which 
> allows to pass a set of field accessors (a class that supports representing 
> the access to any JSON field, including nested ones) for which we wan't do 
> not want to recurse and instead force a schema. That would allow, in 
> particular, to ask that a few fields be inferred as maps rather than structs.
> I am very open to discuss this with people who are more well-versed in the 
> Spark codebase than me, because I realize my proposition can feel somewhat 
> patchy. I'll be more than happy to provide some development effort if we 
> manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org