I am not aware of any built-in transform that can do this, however it
should not be that difficult to do with a group-by-key.

Suppose one reads in the CSV file to a PCollection of dictionaries of the
format {'original_column_1': value1, 'original_column_2', value2, ...}.
Suppose further that original_column_N is the index column (which is what
will become the new column names). To compute the transpose you can use the
PTransform

class Transpose(beam.PTransform):
    def __init__(self, index_column):
        self._index_column = index_column
    def expand(self, pcoll):
        return (pcoll
           # Map to tuples of the form (column_name, (index, value))
            | beam.FlatMap(lambda original_row, ix_col: [
                (col, (original_row[ix_col], value))
                for col, value in original_row.items()
                if col != ix_col], self._index_column)
            # Group all values for a column together.
            | beam.GroupByKey()
            # Map to dictionaries of the form {'index': value}
            | beam.Map(lambda (col, values): dict(values,
original_column_name=col)))

You can then apply this to your pcollection by writing

transposed_pcoll = pcoll | Transpose('original_column_N')


On Sun, Jan 13, 2019 at 5:19 PM Sameer Abhyankar <[email protected]>
wrote:

> Hi Eila - While I am not aware of a transpose transform available for CSV
> files, there is a sample pipeline available to transpose a BigQuery table
> and write the results to a different table[1]. It might be possible to
> modify this to work on a CSV source.
>
> [1]
> https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-bigquery-transpose
>
>
> On Sun, Jan 13, 2019 at 1:58 AM OrielResearch Eila Arich-Landkof <
> [email protected]> wrote:
>
>> Hi all,
>>
>> I am working with many CSV files where the common part is the row names
>> and therefore, my processing should be by columns. My plan is to have the
>> tables transposed and have the combines tables written into BQ.
>> So , the code should perform:
>> 1. transpose the tables (columns -> new_rows, rows->new_columns).
>> new_rows x new_columns = new_table
>> 2. extract the new_rows values from the new_tables and write them to big
>> query.
>>
>> Is there an easy way to transpose the CSV files? I am avoiding the usage
>> of pandas library because the size of the tables could be very large.
>> should I be concern by the table size. Is this consideration relevant or
>> should the Apache Beam be able to handle the resources for the pandas?
>>
>> What is my other option? is there any built in transpose method that I am
>> not aware of?
>>
>> Thanks for your help,
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>> m/Deep-Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>

Reply via email to