If it’s one 33mb file which decompressed to 1.5g then there is also a chance you need to split the inputs since gzip is a non-splittable compression format.
On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias <zouz...@gmail.com> wrote: > Are you sure that your JSON file has the right format? > > spark.read.json(...) expects a file where *each line is a json object*. > > My wild guess is that > > val hdf=spark.read.json("/user/tmp/hugedatafile") > hdf.show(2) or hdf.take(1) gives OOM > > tries to fetch all the data into the driver. Can you reformat your input > file and try again? > > Best, > Anastasios > > > > On Tue, Jun 5, 2018 at 8:39 PM, raksja <shanmugkr...@gmail.com> wrote: > >> I have a json file which is a continuous array of objects of similar type >> [{},{}...] for about 1.5GB uncompressed and 33MB gzip compressed. >> >> This is uploaded hugedatafile to hdfs and this is not a JSONL file, its a >> whole regular json file. >> >> >> [{"id":"1","entityMetadata":{"lastChange":"2018-05-11 >> 01:09:18.0","createdDateTime":"2018-05-11 >> 01:09:18.0","modifiedDateTime":"2018-05-11 >> >> 01:09:18.0"},"type":"11"},{"id":"2","entityMetadata":{"lastChange":"2018-05-11 >> 01:09:18.0","createdDateTime":"2018-05-11 >> 01:09:18.0","modifiedDateTime":"2018-05-11 >> >> 01:09:18.0"},"type":"11"},{"id":"3","entityMetadata":{"lastChange":"2018-05-11 >> 01:09:18.0","createdDateTime":"2018-05-11 >> 01:09:18.0","modifiedDateTime":"2018-05-11 >> 01:09:18.0"},"type":"11"}..................] >> >> >> I get OOM on executors whenever i try to load this into spark. >> >> Try 1 >> val hdf=spark.read.json("/user/tmp/hugedatafile") >> hdf.show(2) or hdf.take(1) gives OOM >> >> Try 2 >> Took a small sampledatafile and got schema to avoid schema infering >> val sampleSchema=spark.read.json("/user/tmp/sampledatafile").schema >> val hdf=spark.read.schema(sampleSchema).json("/user/tmp/hugedatafile") >> hdf.show(2) or hdf.take(1) stuck for 1.5 hrs and gives OOM >> >> Try 3 >> Repartition it after before performing action >> gives OOM >> >> Try 4 >> Read about the https://issues.apache.org/jira/browse/SPARK-20980 >> completely >> val hdf = spark.read.option("multiLine", >> true)..schema(sampleSchema).json("/user/tmp/hugedatafile") >> hdf.show(1) or hdf.take(1) gives OOM >> >> >> Can any one help me here? >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > -- Anastasios Zouzias > <a...@zurich.ibm.com> > -- Twitter: https://twitter.com/holdenkarau