Re: Dataframe from 1.5G json (non JSONL)

Holden Karau Tue, 05 Jun 2018 13:16:22 -0700

If it’s one 33mb file which decompressed to 1.5g then there is also a
chance you need to split the inputs since gzip is a non-splittable
compression format.


On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias <zouz...@gmail.com>
wrote:

> Are you sure that your JSON file has the right format?
>
> spark.read.json(...) expects a file where *each line is a json object*.
>
> My wild guess is that
>
> val hdf=spark.read.json("/user/tmp/hugedatafile")
> hdf.show(2) or hdf.take(1) gives OOM
>
> tries to fetch all the data into the driver. Can you reformat your input
> file and try again?
>
> Best,
> Anastasios
>
>
>
> On Tue, Jun 5, 2018 at 8:39 PM, raksja <shanmugkr...@gmail.com> wrote:
>
>> I have a json file which is a continuous array of objects of similar type
>> [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed.
>>
>> This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a
>> whole regular json file.
>>
>>
>> [{"id":"1","entityMetadata":{"lastChange":"2018-05-11
>> 01:09:18.0","createdDateTime":"2018-05-11
>> 01:09:18.0","modifiedDateTime":"2018-05-11
>>
>> 01:09:18.0"},"type":"11"},{"id":"2","entityMetadata":{"lastChange":"2018-05-11
>> 01:09:18.0","createdDateTime":"2018-05-11
>> 01:09:18.0","modifiedDateTime":"2018-05-11
>>
>> 01:09:18.0"},"type":"11"},{"id":"3","entityMetadata":{"lastChange":"2018-05-11
>> 01:09:18.0","createdDateTime":"2018-05-11
>> 01:09:18.0","modifiedDateTime":"2018-05-11
>> 01:09:18.0"},"type":"11"}..................]
>>
>>
>> I get OOM on executors whenever i try to load this into spark.
>>
>> Try 1
>> val hdf=spark.read.json("/user/tmp/hugedatafile")
>> hdf.show(2) or hdf.take(1) gives OOM
>>
>> Try 2
>> Took a small sampledatafile and got schema to avoid schema infering
>> val sampleSchema=spark.read.json("/user/tmp/sampledatafile").schema
>> val hdf=spark.read.schema(sampleSchema).json("/user/tmp/hugedatafile")
>> hdf.show(2) or hdf.take(1) stuck for 1.5 hrs and gives OOM
>>
>> Try 3
>> Repartition it after before performing action
>> gives OOM
>>
>> Try 4
>> Read about the https://issues.apache.org/jira/browse/SPARK-20980
>> completely
>> val hdf = spark.read.option("multiLine",
>> true)..schema(sampleSchema).json("/user/tmp/hugedatafile")
>> hdf.show(1) or hdf.take(1) gives OOM
>>
>>
>> Can any one help me here?
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> -- Anastasios Zouzias
> <a...@zurich.ibm.com>
>
-- 
Twitter: https://twitter.com/holdenkarau

Re: Dataframe from 1.5G json (non JSONL)

Reply via email to