[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649 ] David Courtinot edited comment on SPARK-23520 at 3/1/18 9:26 PM: - Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's no way of distinguishing a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as strings because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. I should probably throw an exception when a field in the set is anything else than a JSON object now that I think of it. was (Author: dicee): Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's no way of distinguishing a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as maps because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. I should probably throw an exception when a field in the set is anything else than a JSON object now that I think of it. > Add support for MapType fields in JSON schema inference > --- > > Key: SPARK-23520 > URL: https://issues.apache.org/jira/browse/SPARK-23520 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: David Courtinot >Priority: Major > > _InferSchema_ currently does not support inferring _MapType_ fields from JSON > data, and for a good reason: they are indistinguishable from structs in JSON > format. In issue > [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed > to expose some methods of _InferSchema_ to users so that they can build on > top of the inference primitives defined by this class. In this issue, I'm > proposing to add more control to the user by letting them specify a set of > fields that should be forced as _MapType._ > *Use-case* > Some JSON datasets contain high-cardinality fields, namely fields which key > space is very large. These fields shouldn't be interpreted as _StructType_ > for the following reasons: > * it's not really what they are. The key space as well as the value space > may both be infinite, so what best defines the schema of this data is the > type of the keys and the type of the values, not a struct containing all > possible key-value pairs. > * interpreting high-cardinality fields as structs can lead to enormous > schemata that don't even fit into memory. > *Proposition* > We would add a public overloaded signature for _InferSchema.inferField_ which > allows to pass a set of field accessors (a class that supports representing > the access to any JSON field, including nested ones) for which we wan't do > not want to recurse and instead force a schema. That would allow, in > particular, to ask that a few fields be inferred as maps
[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649 ] David Courtinot edited comment on SPARK-23520 at 3/1/18 9:24 PM: - Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's no way of distinguishing a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as maps because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. I should probably throw an exception when a field in the set is anything else than a JSON object now that I think of it. was (Author: dicee): Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's no way of distinguishing a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as maps because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. > Add support for MapType fields in JSON schema inference > --- > > Key: SPARK-23520 > URL: https://issues.apache.org/jira/browse/SPARK-23520 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: David Courtinot >Priority: Major > > _InferSchema_ currently does not support inferring _MapType_ fields from JSON > data, and for a good reason: they are indistinguishable from structs in JSON > format. In issue > [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed > to expose some methods of _InferSchema_ to users so that they can build on > top of the inference primitives defined by this class. In this issue, I'm > proposing to add more control to the user by letting them specify a set of > fields that should be forced as _MapType._ > *Use-case* > Some JSON datasets contain high-cardinality fields, namely fields which key > space is very large. These fields shouldn't be interpreted as _StructType_ > for the following reasons: > * it's not really what they are. The key space as well as the value space > may both be infinite, so what best defines the schema of this data is the > type of the keys and the type of the values, not a struct containing all > possible key-value pairs. > * interpreting high-cardinality fields as structs can lead to enormous > schemata that don't even fit into memory. > *Proposition* > We would add a public overloaded signature for _InferSchema.inferField_ which > allows to pass a set of field accessors (a class that supports representing > the access to any JSON field, including nested ones) for which we wan't do > not want to recurse and instead force a schema. That would allow, in > particular, to ask that a few fields be inferred as maps rather than structs. > I am very open to discuss this with people who are more well-versed in the > Spark codebase than me,
[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649 ] David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM: - Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's no way of distinguishing a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as maps because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. was (Author: dicee): Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's now way to distinguish a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as maps because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. > Add support for MapType fields in JSON schema inference > --- > > Key: SPARK-23520 > URL: https://issues.apache.org/jira/browse/SPARK-23520 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: David Courtinot >Priority: Major > > _InferSchema_ currently does not support inferring _MapType_ fields from JSON > data, and for a good reason: they are indistinguishable from structs in JSON > format. In issue > [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed > to expose some methods of _InferSchema_ to users so that they can build on > top of the inference primitives defined by this class. In this issue, I'm > proposing to add more control to the user by letting them specify a set of > fields that should be forced as _MapType._ > *Use-case* > Some JSON datasets contain high-cardinality fields, namely fields which key > space is very large. These fields shouldn't be interpreted as _StructType_ > for the following reasons: > * it's not really what they are. The key space as well as the value space > may both be infinite, so what best defines the schema of this data is the > type of the keys and the type of the values, not a struct containing all > possible key-value pairs. > * interpreting high-cardinality fields as structs can lead to enormous > schemata that don't even fit into memory. > *Proposition* > We would add a public overloaded signature for _InferSchema.inferField_ which > allows to pass a set of field accessors (a class that supports representing > the access to any JSON field, including nested ones) for which we wan't do > not want to recurse and instead force a schema. That would allow, in > particular, to ask that a few fields be inferred as maps rather than structs. > I am very open to discuss this with people who are more well-versed in the > Spark codebase than me, because I realize my proposition can feel somewhat > patchy. I'll be more than happy to provide some development effort if we
[jira] [Comment Edited] (SPARK-23520) Add support for MapType fields in JSON schema inference
[ https://issues.apache.org/jira/browse/SPARK-23520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382649#comment-16382649 ] David Courtinot edited comment on SPARK-23520 at 3/1/18 9:22 PM: - Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. My current approach is the following: * since there's now way to distinguish a map from a struct in JSON, I allow the user to pass a set of optional fields (nested fields are supported) that they want to be inferred as maps * I modified _inferField_ to track the current field path in order to compare it with the user-provided paths * when I encounter an object, I infer all the _StructField_(s) as usual, but I also check whether the field path is included in the user-provided set. If it is, I reduce the _StructField_(s) to a single _DataType_ by calling _compatibleRootType_ on their _dataType()_ * I always infer the keys as maps because nothing else would make sense in JSON * I added a clause in _compatibleType_ in order to merge the value types of two _MapType_. I check that both maps have _StringType_ as their key type. was (Author: dicee): Good catch. I think the issue is very similar indeed. However, it seems like the issue I created better outlines the pros of taking care of this problem. For what it's worth, I should mention that I'm currently experimenting with my own Java (had to rewrite it for my team's codebase unfortunately) version which supports this feature. I hope to be able to propose a pull request in Scala soon-ish. > Add support for MapType fields in JSON schema inference > --- > > Key: SPARK-23520 > URL: https://issues.apache.org/jira/browse/SPARK-23520 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: David Courtinot >Priority: Major > > _InferSchema_ currently does not support inferring _MapType_ fields from JSON > data, and for a good reason: they are indistinguishable from structs in JSON > format. In issue > [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494], I proposed > to expose some methods of _InferSchema_ to users so that they can build on > top of the inference primitives defined by this class. In this issue, I'm > proposing to add more control to the user by letting them specify a set of > fields that should be forced as _MapType._ > *Use-case* > Some JSON datasets contain high-cardinality fields, namely fields which key > space is very large. These fields shouldn't be interpreted as _StructType_ > for the following reasons: > * it's not really what they are. The key space as well as the value space > may both be infinite, so what best defines the schema of this data is the > type of the keys and the type of the values, not a struct containing all > possible key-value pairs. > * interpreting high-cardinality fields as structs can lead to enormous > schemata that don't even fit into memory. > *Proposition* > We would add a public overloaded signature for _InferSchema.inferField_ which > allows to pass a set of field accessors (a class that supports representing > the access to any JSON field, including nested ones) for which we wan't do > not want to recurse and instead force a schema. That would allow, in > particular, to ask that a few fields be inferred as maps rather than structs. > I am very open to discuss this with people who are more well-versed in the > Spark codebase than me, because I realize my proposition can feel somewhat > patchy. I'll be more than happy to provide some development effort if we > manage to sketch a reasonably easy solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org