Re: Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Michael Armbrust
> > Each json file is of a single object and has the potential to have > variance in the schema. > How much variance are we talking? JSON->Parquet is going to do well with 100s of different columns, but at 10,000s many things will probably start breaking.

Spark SQL / Parquet - Dynamic Schema detection

2016-03-14 Thread Anthony Andras
Hello there, I am trying to write a program in Spark that is attempting to load multiple json files (with undefined schemas) into a dataframe and then write it out to a parquet file. When doing so, I am running into a number of garbage collection issues as a result of my JVM running out of heap