Hi,
not sure if it is your case, but if the source data is heavy and deeply
nested I'd recommend explicitly providing the schema when reading the json.
df = spark.read.schema(schema).json(updated_dataset)
On Thu, 21 Jan 2021 at 04:15, srinivasarao daruna
wrote:
> Hi,
> I am running a spark
Hi,
I am running a spark job on a huge dataset. I have allocated 10 R5.16xlarge
machines. (each consists 64cores, 512G).
The source data is json and i need to do some json transformations. So, i
read them as text and then convert to a dataframe.
ds = spark.read.textFile()
updated_dataset =