Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Michael Armbrust
On Sun, Oct 16, 2016 at 3:50 AM,  wrote:

> Think of it as jsonl instead of a json file.
> Point people at this if they need an official looking spec:
> http://jsonlines.org/
>

That link is awesome.  I think it would be great if someone could open a PR
to add this to our documentation.

I'd also be happy to add a flag to support multiline objects at file
boundaries, but someone needs to propose a scalable way to it.  While that
blog post it a good resource, it would easily cause OOMs on large files.


Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread WangJianfei
thank you!
 But I think is's user unfriendly to process standard json file with
DataFrame. Need we provide a new overrided method to do this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp19464p19468.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread trsell
Think of it as jsonl instead of a json file.
Point people at this if they need an official looking spec:
http://jsonlines.org/

One good reason for using this format is you can split mid file easily.
This make it work well with standard unix tools in pipes.


On Sun, 16 Oct 2016 at 16:24 WangJianfei 
wrote:

> Thank you very much! I will have a look about your link.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp19464p19466.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread WangJianfei
Thank you very much! I will have a look about your link.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp19464p19466.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-15 Thread Hyukjin Kwon
Hi,


The reason is just simply JSON data source depends on Hadoop's
LineRecordReader when we first try to read the files.

There is a workaround for this here in this link,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

I hope this is helpful.


Thanks!


2016-10-16 11:20 GMT+09:00 WangJianfei :

> Hi devs:
>I'm doubt about the design of spark.read.json,  why the json file is not
> a standard json file, who can tell me the internal reason. Any advice is
> appreciated.
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Why-the-json-file-
> used-by-sparkSession-read-json-must-be-a-valid-json-
> object-per-line-tp19464.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>