Re: How to combine all rows into a single row in DataFrame

2019-08-19 Thread Alex Landa
Hi,

It sounds similar to what we do in our application.
We don't serialize every row, but instead we group first the rows into the
wanted representation and then apply protobuf serialization using map and
lambda.
I suggest not to serialize the entire DataFrame into a single protobuf
message since it may cause OOM errors.

Thanks,
Alex

On Mon, Aug 19, 2019 at 11:24 PM Rishikesh Gawade 
wrote:

> Hi All,
> I have been trying to serialize a dataframe in protobuf format. So far, I
> have been able to serialize every row of the dataframe by using map
> function and the logic for serialization within the same(within the lambda
> function). The resultant dataframe consists of rows in serialized format(1
> row = 1 serialized message).
> I wish to form a single protobuf serialized message for this dataframe and
> in order to do that i need to combine all the serialized rows using some
> custom logic very similar to the one used in map operation.
> I am assuming that this would be possible by using the reduce operation on
> the dataframe, however, i am unaware of how to go about it.
> Any suggestions/approach would be much appreciated.
>
> Thanks,
> Rishikesh
>


Re: How to combine all rows into a single row in DataFrame

2019-08-19 Thread Marcin Tustin
It sounds like you want to aggregate your rows in some way. I actually just
wrote a blog post about that topic:
https://medium.com/@albamus/spark-aggregating-your-data-the-fast-way-e37b53314fad

On Mon, Aug 19, 2019 at 4:24 PM Rishikesh Gawade 
wrote:

> *This Message originated outside your organization.*
> --
> Hi All,
> I have been trying to serialize a dataframe in protobuf format. So far, I
> have been able to serialize every row of the dataframe by using map
> function and the logic for serialization within the same(within the lambda
> function). The resultant dataframe consists of rows in serialized format(1
> row = 1 serialized message).
> I wish to form a single protobuf serialized message for this dataframe and
> in order to do that i need to combine all the serialized rows using some
> custom logic very similar to the one used in map operation.
> I am assuming that this would be possible by using the reduce operation on
> the dataframe, however, i am unaware of how to go about it.
> Any suggestions/approach would be much appreciated.
>
> Thanks,
> Rishikesh
>


How to combine all rows into a single row in DataFrame

2019-08-19 Thread Rishikesh Gawade
Hi All,
I have been trying to serialize a dataframe in protobuf format. So far, I
have been able to serialize every row of the dataframe by using map
function and the logic for serialization within the same(within the lambda
function). The resultant dataframe consists of rows in serialized format(1
row = 1 serialized message).
I wish to form a single protobuf serialized message for this dataframe and
in order to do that i need to combine all the serialized rows using some
custom logic very similar to the one used in map operation.
I am assuming that this would be possible by using the reduce operation on
the dataframe, however, i am unaware of how to go about it.
Any suggestions/approach would be much appreciated.

Thanks,
Rishikesh