Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-20 Thread Steve Loughran

> On 19 Oct 2016, at 21:46, Jakob Odersky  wrote:
> 
> Another reason I could imagine is that files are often read from HDFS,
> which by default uses line terminators to separate records.
> 
> It is possible to implement your own hdfs delimiter finder, however
> for arbitrary json data, finding that delimiter would require stateful
> parsing of the file and would be difficult to parallelize across a
> cluster.
> 


good point. 

If you are creating your own files of a list of JSON files, then you could do 
your own encoding, one with say a header for each record (say 'J'+'S'+'O'+'N' + 
int64 length, and split on that: you don't need to scan a record to know its 
length, and you can scan a large document counting its records simply though a 
sequence of skip + read(byte[8]) operations.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Jakob Odersky
Another reason I could imagine is that files are often read from HDFS,
which by default uses line terminators to separate records.

It is possible to implement your own hdfs delimiter finder, however
for arbitrary json data, finding that delimiter would require stateful
parsing of the file and would be difficult to parallelize across a
cluster.

On Tue, Oct 18, 2016 at 4:40 PM, Hyukjin Kwon  wrote:
> Regarding his recent PR[1], I guess he meant multiple line json.
>
> As far as I know, single line json also conplies the standard. I left a
> comment with RFC in the PR but please let me know if I am wrong at any
> point.
>
> Thanks!
>
> [1]https://github.com/apache/spark/pull/15511
>
>
> On 19 Oct 2016 7:00 a.m., "Daniel Barclay" 
> wrote:
>>
>> Koert,
>>
>> Koert Kuipers wrote:
>>
>> A single json object would mean for most parsers it needs to fit in memory
>> when reading or writing
>>
>> Note that codlife didn't seem to being asking about single-object JSON
>> files, but about standard-format JSON files.
>>
>>
>> On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
>>>
>>> Hi:
>>>I'm doubt about the design of spark.read.json,  why the json file is
>>> not
>>> a standard json file, who can tell me the internal reason. Any advice is
>>> appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json.

As far as I know, single line json also conplies the standard. I left a
comment with RFC in the PR but please let me know if I am wrong at any
point.

Thanks!

[1]https://github.com/apache/spark/pull/15511

On 19 Oct 2016 7:00 a.m., "Daniel Barclay" 
wrote:

> Koert,
>
> Koert Kuipers wrote:
>
> A single json object would mean for most parsers it needs to fit in memory
> when reading or writing
>
> Note that codlife didn't seem to being asking about *single-object* JSON
> files, but about *standard-format* JSON files.
>
>
> On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
>
>> Hi:
>>I'm doubt about the design of spark.read.json,  why the json file is
>> not
>> a standard json file, who can tell me the internal reason. Any advice is
>> appreciated.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession
>> -read-json-must-be-a-valid-json-object-per-line-tp27907.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Daniel Barclay

Koert,

Koert Kuipers wrote:


A single json object would mean for most parsers it needs to fit in memory when 
reading or writing


Note that codlife didn't seem to being asking about /single-object/ JSON files, 
but about /standard-format/ JSON files.


On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com > 
wrote:

Hi:
   I'm doubt about the design of spark.read.json,  why the json file is not
a standard json file, who can tell me the internal reason. Any advice is
appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
 

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 






Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread Koert Kuipers
A single json object would mean for most parsers it needs to fit in memory
when reading or writing

On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:

> Hi:
>I'm doubt about the design of spark.read.json,  why the json file is not
> a standard json file, who can tell me the internal reason. Any advice is
> appreciated.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Why-the-json-file-used-by-
> sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-15 Thread codlife
Hi:
   I'm doubt about the design of spark.read.json,  why the json file is not
a standard json file, who can tell me the internal reason. Any advice is
appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp27907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org