As a follow-up question, what happened to org.apache.spark.sql.parquet.RowWriteSupport ? It seems like it would help me.
On Thu, Apr 19, 2018 at 9:23 PM, Christopher Piggott <cpigg...@gmail.com> wrote: > I am trying to write some parquet files and running out of memory. I'm > giving my workers each 16GB and the data is 102 columns * 65536 rows - not > really all that much. The content of each row is a short string. > > I am trying to create the file by dynamically allocating a StructType of > StructField objects. I then tried various methods of building an Array or > List of Row objects for each of the 65,536 rows. The last attempt was to > create an ArrayBuffer of the correct length. > > In all cases, I run out of memory. > > It occurs to me that what I really need is a way to generate and stream > the parquet files directly to an HDFS file. I have 70,000+ of these files, > so for starters I'm OK with creating 70,000 parquet files as long as > there's some way I can merge them later. > > Is there an approach for generating parquet files from spark (ultimately > to HDFS) that lets me put each row out one at a time, in a streaming > fashion? > > BTW I'm using spark 2.2.1 and whatever parquet library was bundled within. > > --Chris > >