Re: Multiple DataFrames per Parquet file?

ayan guha Sun, 10 May 2015 08:46:28 -0700

Hi

In that case read entire folder as a rdd and give some reasonable number of
partitions.


Best
Ayan
On 11 May 2015 01:35, "Peter Aberline" <peter.aberl...@gmail.com> wrote:

> Hi
>
> Thanks for the quick response.
>
> No I'm not using Streaming. Each DataFrame represents tabular data read
> from a CSV file. They have the same schema.
>
> There is also the option of appending each DF to the parquet file, but
> then I can't maintain them as separate DF when reading back in without
> filtering.
>
> I'll rethink maintaining each CSV file as a single DF.
>
> Thanks,
> Peter
>
>
> On 10 May 2015 at 15:51, ayan guha <guha.a...@gmail.com> wrote:
>
>> How did you end up with thousands of df? Are you using streaming?  In
>> that case you can do foreachRDD and keep merging incoming rdds to single
>> rdd and then save it through your own checkpoint mechanism.
>>
>> If not, please share your use case.
>> On 11 May 2015 00:38, "Peter Aberline" <peter.aberl...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I have many thousands of small DataFrames that I would like to save to
>>> the one Parquet file to avoid the HDFS 'small files' problem. My
>>> understanding is that there is a 1:1 relationship between DataFrames and
>>> Parquet files if a single partition is used.
>>>
>>> Is it possible to have multiple DataFrames within the one Parquet File
>>> using PySpark?
>>> Or is the only way to achieve this to union the DataFrames into one?
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>

Re: Multiple DataFrames per Parquet file?

Reply via email to