Advice for piping many CSVs with different columns names to one bigQuery table

OrielResearch Eila Arich-Landkof Tue, 25 Sep 2018 12:13:59 -0700

Hello,
I would like to write large number of CSV file to BQ where the headers from
all of them is aggregated to one common headers. any advice is very
appreciated.


The details are:
1. 2.5M CSV files
2. Each CSV file: header of 50-60 columns
2. Each CSV file: one data row

there are common columns between the CSV file but I dont know them in
advance.I would like to have all the csv files in one bigQuery table.

My current method:
When it was smaller amount of files, I read the csv files and appended them
to one pandas dataframe that was written to a file (total.csv). total.csv
was the input to the beam pipeline.

small CSVs => Pandas DF => total CSV => pCollection => Big Query

The challenge with that approach is that the pandas will require large
memory in order to hold the 2.5M csv files before writing them to BQ.

Is there a different way to pipe the CSVs to BQ? One option will be to
split the CSVs to batchs and write them to different BQ tables or append to
one table.

Any thoughts how to do it without extra coding?

Many thanks,
-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Advice for piping many CSVs with different columns names to one bigQuery table

Reply via email to