[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema
[ https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119501#comment-16119501 ] Jochen Niebuhr commented on SPARK-21651: Specifying the Schema myself would mean I'll have to change it every time a new field appears. With the current implementation you could write some schema to JSON with spark and it'll read a different schema or not be able to read it at all if you're using Maps. We could add some flag which activates this feature but I think this might be helpful for some people. > Detect MapType in Json InferSchema > -- > > Key: SPARK-21651 > URL: https://issues.apache.org/jira/browse/SPARK-21651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Jochen Niebuhr >Priority: Minor > > When loading Json Files which include a map with very variable keys, the > current schema infer logic might create a very large schema. This will lead > to long load times and possibly out of memory errors. > I've already submitted a pull request to the mongo spark driver which had the > same problem. Should I port this logic over to the json schema infer class? > The MongoDB Spark pull request mentioned is: > https://github.com/mongodb/mongo-spark/pull/24 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema
[ https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119484#comment-16119484 ] Takeshi Yamamuro commented on SPARK-21651: -- Specifying a schema by yourself is not enough for your case? {code} json data: {"id":"0001", "relations": {"r1": {"x": "7", "z": "1"}, "r2": {"y": "8"} } } scala> spark.read.json("xxx").printSchema root |-- id: string (nullable = true) |-- value: struct (nullable = true) ||-- a: struct (nullable = true) |||-- x: string (nullable = true) |||-- z: string (nullable = true) ||-- b: struct (nullable = true) |||-- y: string (nullable = true) scala> spark.read.schema("id STRING, relations MAP>").json("xxx").printSchema root |-- id: string (nullable = true) |-- value: map (nullable = true) ||-- key: string ||-- value: struct (valueContainsNull = true) |||-- x: string (nullable = true) |||-- y: string (nullable = true) |||-- z: string (nullable = true) {code} I feel this optimization is a little domain-specific and any other runtime supporting this kind of optimization? > Detect MapType in Json InferSchema > -- > > Key: SPARK-21651 > URL: https://issues.apache.org/jira/browse/SPARK-21651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Jochen Niebuhr >Priority: Minor > > When loading Json Files which include a map with very variable keys, the > current schema infer logic might create a very large schema. This will lead > to long load times and possibly out of memory errors. > I've already submitted a pull request to the mongo spark driver which had the > same problem. Should I port this logic over to the json schema infer class? > The MongoDB Spark pull request mentioned is: > https://github.com/mongodb/mongo-spark/pull/24 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema
[ https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119450#comment-16119450 ] Jochen Niebuhr commented on SPARK-21651: If you have an entity in json objects which stores some relations to other entites, it might look like this: {code} { "id": "06d32281-db4d-4d47-911a-c0b59cc0ed26", "relations": { "38401db2-1036-499f-b21e-e9be532cddb2": { /* ... some relation content ... */ }, "1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } } { "id": "38401db2-1036-499f-b21e-e9be532cddb2", "relations": { "06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, "1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } } { "id": "1cbb297c-cec8-4288-9edc-9d4b5dad3eec", "relations": { "06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, "38401db2-1036-499f-b21e-e9be532cddb2": { /* ... */ } } } {code} If I'm putting that JSON through the JSON Infer Schema step, it will generate a schema like this: {code}Struct, 1cbb297c-cec8-4288-9edc-9d4b5dad3eec: Struct<>, 06d32281-db4d-4d47-911a-c0b59cc0ed26: Struct<>>>{code} If I do this with the sample of 100.000 documents, the schema will become very large and probably crash my job or at least take forever. But since everything in relations shares the same Key and Value types, I could just say relations is a MapType. My schema wouldn't grow as large and I could simply query it. So the expected schema would be: {code}Struct>>{code} In the version I implemented in the MongoDB driver this behavior has the following requirements: * Over 250 keys in a single Struct * All Value Types are compatible > Detect MapType in Json InferSchema > -- > > Key: SPARK-21651 > URL: https://issues.apache.org/jira/browse/SPARK-21651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Jochen Niebuhr >Priority: Minor > > When loading Json Files which include a map with very variable keys, the > current schema infer logic might create a very large schema. This will lead > to long load times and possibly out of memory errors. > I've already submitted a pull request to the mongo spark driver which had the > same problem. Should I port this logic over to the json schema infer class? > The MongoDB Spark pull request mentioned is: > https://github.com/mongodb/mongo-spark/pull/24 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema
[ https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119266#comment-16119266 ] Hyukjin Kwon commented on SPARK-21651: -- Yea, showing expected input / output would be nicer. How would we distinguish struct / map in schema inference? > Detect MapType in Json InferSchema > -- > > Key: SPARK-21651 > URL: https://issues.apache.org/jira/browse/SPARK-21651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Jochen Niebuhr >Priority: Minor > > When loading Json Files which include a map with very variable keys, the > current schema infer logic might create a very large schema. This will lead > to long load times and possibly out of memory errors. > I've already submitted a pull request to the mongo spark driver which had the > same problem. Should I port this logic over to the json schema infer class? > The MongoDB Spark pull request mentioned is: > https://github.com/mongodb/mongo-spark/pull/24 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema
[ https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119158#comment-16119158 ] Takeshi Yamamuro commented on SPARK-21651: -- What's the concrete example? > Detect MapType in Json InferSchema > -- > > Key: SPARK-21651 > URL: https://issues.apache.org/jira/browse/SPARK-21651 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Jochen Niebuhr >Priority: Minor > > When loading Json Files which include a map with very variable keys, the > current schema infer logic might create a very large schema. This will lead > to long load times and possibly out of memory errors. > I've already submitted a pull request to the mongo spark driver which had the > same problem. Should I port this logic over to the json schema infer class? > The MongoDB Spark pull request mentioned is: > https://github.com/mongodb/mongo-spark/pull/24 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org