[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema

2017-08-08 Thread Jochen Niebuhr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119501#comment-16119501
 ] 

Jochen Niebuhr commented on SPARK-21651:


Specifying the Schema myself would mean I'll have to change it every time a new 
field appears.
With the current implementation you could write some schema to JSON with spark 
and it'll read a different schema or not be able to read it at all if you're 
using Maps.
We could add some flag which activates this feature but I think this might be 
helpful for some people.

> Detect MapType in Json InferSchema
> --
>
> Key: SPARK-21651
> URL: https://issues.apache.org/jira/browse/SPARK-21651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Jochen Niebuhr
>Priority: Minor
>
> When loading Json Files which include a map with very variable keys, the 
> current schema infer logic might create a very large schema. This will lead 
> to long load times and possibly out of memory errors. 
> I've already submitted a pull request to the mongo spark driver which had the 
> same problem. Should I port this logic over to the json schema infer class?
> The MongoDB Spark pull request mentioned is: 
> https://github.com/mongodb/mongo-spark/pull/24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema

2017-08-08 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119484#comment-16119484
 ] 

Takeshi Yamamuro commented on SPARK-21651:
--

Specifying a schema by yourself is not enough for your case?
{code}
json data:
{"id":"0001", "relations": {"r1": {"x": "7", "z": "1"}, "r2": {"y": "8"} } }

scala> spark.read.json("xxx").printSchema
root
 |-- id: string (nullable = true)
 |-- value: struct (nullable = true)
 ||-- a: struct (nullable = true)
 |||-- x: string (nullable = true)
 |||-- z: string (nullable = true)
 ||-- b: struct (nullable = true)
 |||-- y: string (nullable = true)

scala> spark.read.schema("id STRING, relations MAP>").json("xxx").printSchema
root
 |-- id: string (nullable = true)
 |-- value: map (nullable = true)
 ||-- key: string
 ||-- value: struct (valueContainsNull = true)
 |||-- x: string (nullable = true)
 |||-- y: string (nullable = true)
 |||-- z: string (nullable = true)
{code}

I feel this optimization is a little domain-specific and any other runtime 
supporting this kind of optimization?


> Detect MapType in Json InferSchema
> --
>
> Key: SPARK-21651
> URL: https://issues.apache.org/jira/browse/SPARK-21651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Jochen Niebuhr
>Priority: Minor
>
> When loading Json Files which include a map with very variable keys, the 
> current schema infer logic might create a very large schema. This will lead 
> to long load times and possibly out of memory errors. 
> I've already submitted a pull request to the mongo spark driver which had the 
> same problem. Should I port this logic over to the json schema infer class?
> The MongoDB Spark pull request mentioned is: 
> https://github.com/mongodb/mongo-spark/pull/24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema

2017-08-08 Thread Jochen Niebuhr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119450#comment-16119450
 ] 

Jochen Niebuhr commented on SPARK-21651:


If you have an entity in json objects which stores some relations to other 
entites, it might look like this:
{code}
{ "id": "06d32281-db4d-4d47-911a-c0b59cc0ed26", "relations": { 
"38401db2-1036-499f-b21e-e9be532cddb2": { /* ... some relation content ... */ 
}, "1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } }
{ "id": "38401db2-1036-499f-b21e-e9be532cddb2", "relations": { 
"06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, 
"1cbb297c-cec8-4288-9edc-9d4b5dad3eec": { /* ... */ } } }
{ "id": "1cbb297c-cec8-4288-9edc-9d4b5dad3eec", "relations": { 
"06d32281-db4d-4d47-911a-c0b59cc0ed26": { /* ... */ }, 
"38401db2-1036-499f-b21e-e9be532cddb2": { /* ... */ } } }
{code}

If I'm putting that JSON through the JSON Infer Schema step, it will generate a 
schema like this:
{code}Struct, 
1cbb297c-cec8-4288-9edc-9d4b5dad3eec: Struct<>, 
06d32281-db4d-4d47-911a-c0b59cc0ed26: Struct<>>>{code}

If I do this with the sample of 100.000 documents, the schema will become very 
large and probably crash my job or at least take forever. But since everything 
in relations shares the same Key and Value types, I could just say relations is 
a MapType. My schema wouldn't grow as large and I could simply query it.

So the expected schema would be:
{code}Struct>>{code}

In the version I implemented in the MongoDB driver this behavior has the 
following requirements: 
* Over 250 keys in a single Struct
* All Value Types are compatible

> Detect MapType in Json InferSchema
> --
>
> Key: SPARK-21651
> URL: https://issues.apache.org/jira/browse/SPARK-21651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Jochen Niebuhr
>Priority: Minor
>
> When loading Json Files which include a map with very variable keys, the 
> current schema infer logic might create a very large schema. This will lead 
> to long load times and possibly out of memory errors. 
> I've already submitted a pull request to the mongo spark driver which had the 
> same problem. Should I port this logic over to the json schema infer class?
> The MongoDB Spark pull request mentioned is: 
> https://github.com/mongodb/mongo-spark/pull/24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema

2017-08-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119266#comment-16119266
 ] 

Hyukjin Kwon commented on SPARK-21651:
--

Yea, showing expected input / output would be nicer. How would we distinguish 
struct / map in schema inference?

> Detect MapType in Json InferSchema
> --
>
> Key: SPARK-21651
> URL: https://issues.apache.org/jira/browse/SPARK-21651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Jochen Niebuhr
>Priority: Minor
>
> When loading Json Files which include a map with very variable keys, the 
> current schema infer logic might create a very large schema. This will lead 
> to long load times and possibly out of memory errors. 
> I've already submitted a pull request to the mongo spark driver which had the 
> same problem. Should I port this logic over to the json schema infer class?
> The MongoDB Spark pull request mentioned is: 
> https://github.com/mongodb/mongo-spark/pull/24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21651) Detect MapType in Json InferSchema

2017-08-08 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16119158#comment-16119158
 ] 

Takeshi Yamamuro commented on SPARK-21651:
--

What's the concrete example?

> Detect MapType in Json InferSchema
> --
>
> Key: SPARK-21651
> URL: https://issues.apache.org/jira/browse/SPARK-21651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Jochen Niebuhr
>Priority: Minor
>
> When loading Json Files which include a map with very variable keys, the 
> current schema infer logic might create a very large schema. This will lead 
> to long load times and possibly out of memory errors. 
> I've already submitted a pull request to the mongo spark driver which had the 
> same problem. Should I port this logic over to the json schema infer class?
> The MongoDB Spark pull request mentioned is: 
> https://github.com/mongodb/mongo-spark/pull/24



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org