Re: Stream writing parquet files

Christopher Piggott Thu, 19 Apr 2018 18:53:01 -0700

As a follow-up question, what happened
to org.apache.spark.sql.parquet.RowWriteSupport ?  It seems like it would
help me.


On Thu, Apr 19, 2018 at 9:23 PM, Christopher Piggott <cpigg...@gmail.com>
wrote:

> I am trying to write some parquet files and running out of memory.  I'm
> giving my workers each 16GB and the data is 102 columns * 65536 rows - not
> really all that much.  The content of each row is a short string.
>
> I am trying to create the file by dynamically allocating a StructType of
> StructField objects.  I then tried various methods of building an Array or
> List of Row objects for each of the 65,536 rows.  The last attempt was to
> create an ArrayBuffer of the correct length.
>
> In all cases, I run out of memory.
>
> It occurs to me that what I really need is a way to generate and stream
> the parquet files directly to an HDFS file.  I have 70,000+ of these files,
> so for starters I'm OK with creating 70,000 parquet files as long as
> there's some way I can merge them later.
>
> Is there an approach for generating parquet files from spark (ultimately
> to HDFS) that lets me put each row out one at a time, in a streaming
> fashion?
>
> BTW I'm using spark 2.2.1 and whatever parquet library was bundled within.
>
> --Chris
>
>

Re: Stream writing parquet files

Reply via email to