Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-28 Thread Ziyad Muhammed
Hi Eila,

I'm not sure if I understand the complexity of your problem.
If you do not have to perform any transformation on the data inside CSVs
and just need to load them to Bigquery, isn't it enough to use bqload with
schema autodetect ?
https://cloud.google.com/bigquery/docs/schema-detect

Best
Ziyad


On Thu, Sep 27, 2018 at 9:35 PM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Thank you!
> Probably around 50.
>
> Best,
> Eila
>
> On Thu, Sep 27, 2018 at 1:23 AM Ankur Goenka  wrote:
>
>> Hi Eila,
>>
>> That seems reasonable to me.
>>
>> Here is a reference on writing to BQ
>> https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326
>>
>> May I know how many distinct column are you expecting across all files?
>>
>>
>> On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof <
>> e...@orielresearch.org> wrote:
>>
>>> Hi Ankur / users,
>>>
>>> I would like to make sure that the suggested pipeline can work for my
>>> needs.
>>>
>>> So, additional clarification:
>>> - The CSV files have few common and few different columns. Each CSV file
>>> represent a sample measurements record.
>>>
>>> - When the CSVs merged together, I expect to have one table with
>>> combined columns from all samples. Will your suggested pipeline allow that?
>>>
>>> - If I understand correctly the following pipeline:
>>>
>>> *Read files => Parse lines => Generate pCollections for each column =>
>>> GroupBy column name => Write to BQ*
>>>
>>> *Read files:* will generate a pCollection from each CSV line (=file)
>>> *Parse Lines*: will extract the column name and its matching value
>>> (generate a tuple)
>>> *Generate pCollection for each column:* will generate a pCollection
>>> from the tuple
>>> *GroupBy: *will merge each column name with all the relevant samples
>>> values (does it need to know the column names to group by or will it
>>> automatically use the tuple key? )
>>> *WriteToBQ*: will NOT (is that true?) be able to write the values
>>> matching the relevant sample. The samples that didnt have value for a
>>> specific key will get values for other samples values that are available at
>>> the column's collection record
>>>
>>> - If the above understanding correct, will extracting & merging the
>>> headers from all 2.5M CSV files in advance to derived the schema and then
>>> using an additional pipeline to populate the data to BQ with schema will be
>>> the "right" approach?
>>>
>>> Please let me know if I miss something here / what your thoughts are
>>> Many thanks,
>>> Eila
>>>
>>> On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof <
>>> e...@orielresearch.org> wrote:
>>>
 Hi Ankur,

 Thank you. Trying this approach now. Will let you know if I have any
 issue implementing it.
 Best,
 Eila

 On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka  wrote:

> Hi Eila,
>
> If I understand correctly, the objective is to read a large number of
> CSV files, each of which contains a single row with multiple columns.
> Deduplicate the columns in the file and write them to BQ.
> You are using pandas DF to deduplicate the columns for a small set of
> files which might not work for large number of files.
>
> You can use beam groupBy to deduplicate the columns and write them to
> bigquery. Beam is capable of reading and managing large number of files by
> providing path to the directory containing those files.
> So the approach would be ->
> Read files => Parse lines => Generate pCollections for each column =>
> GroupBy column name => Write to BQ
> For reference here is an example of reading file and doing groupby
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>
> Note: I am not very familiar with BQ so can't think of any direct
> approach to dump data to BQ.
>
> Thanks,
> Ankur
>
>
> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hello,
>> I would like to write large number of CSV file to BQ where the
>> headers from all of them is aggregated to one common headers. any advice 
>> is
>> very appreciated.
>>
>> The details are:
>> 1. 2.5M CSV files
>> 2. Each CSV file: header of 50-60 columns
>> 2. Each CSV file: one data row
>>
>> there are common columns between the CSV file but I dont know them in
>> advance.I would like to have all the csv files in one bigQuery table.
>>
>> My current method:
>> When it was smaller amount of files, I read the csv files and
>> appended them to one pandas dataframe that was written to a file
>> (total.csv). total.csv was the input to the beam pipeline.
>>
>> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>>
>> The challenge with that 

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-27 Thread OrielResearch Eila Arich-Landkof
Thank you!
Probably around 50.

Best,
Eila

On Thu, Sep 27, 2018 at 1:23 AM Ankur Goenka  wrote:

> Hi Eila,
>
> That seems reasonable to me.
>
> Here is a reference on writing to BQ
> https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326
>
> May I know how many distinct column are you expecting across all files?
>
>
> On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hi Ankur / users,
>>
>> I would like to make sure that the suggested pipeline can work for my
>> needs.
>>
>> So, additional clarification:
>> - The CSV files have few common and few different columns. Each CSV file
>> represent a sample measurements record.
>>
>> - When the CSVs merged together, I expect to have one table with combined
>> columns from all samples. Will your suggested pipeline allow that?
>>
>> - If I understand correctly the following pipeline:
>>
>> *Read files => Parse lines => Generate pCollections for each column =>
>> GroupBy column name => Write to BQ*
>>
>> *Read files:* will generate a pCollection from each CSV line (=file)
>> *Parse Lines*: will extract the column name and its matching value
>> (generate a tuple)
>> *Generate pCollection for each column:* will generate a pCollection from
>> the tuple
>> *GroupBy: *will merge each column name with all the relevant samples
>> values (does it need to know the column names to group by or will it
>> automatically use the tuple key? )
>> *WriteToBQ*: will NOT (is that true?) be able to write the values
>> matching the relevant sample. The samples that didnt have value for a
>> specific key will get values for other samples values that are available at
>> the column's collection record
>>
>> - If the above understanding correct, will extracting & merging the
>> headers from all 2.5M CSV files in advance to derived the schema and then
>> using an additional pipeline to populate the data to BQ with schema will be
>> the "right" approach?
>>
>> Please let me know if I miss something here / what your thoughts are
>> Many thanks,
>> Eila
>>
>> On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof <
>> e...@orielresearch.org> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thank you. Trying this approach now. Will let you know if I have any
>>> issue implementing it.
>>> Best,
>>> Eila
>>>
>>> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka  wrote:
>>>
 Hi Eila,

 If I understand correctly, the objective is to read a large number of
 CSV files, each of which contains a single row with multiple columns.
 Deduplicate the columns in the file and write them to BQ.
 You are using pandas DF to deduplicate the columns for a small set of
 files which might not work for large number of files.

 You can use beam groupBy to deduplicate the columns and write them to
 bigquery. Beam is capable of reading and managing large number of files by
 providing path to the directory containing those files.
 So the approach would be ->
 Read files => Parse lines => Generate pCollections for each column =>
 GroupBy column name => Write to BQ
 For reference here is an example of reading file and doing groupby
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py

 Note: I am not very familiar with BQ so can't think of any direct
 approach to dump data to BQ.

 Thanks,
 Ankur


 On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
 e...@orielresearch.org> wrote:

> Hello,
> I would like to write large number of CSV file to BQ where the headers
> from all of them is aggregated to one common headers. any advice is very
> appreciated.
>
> The details are:
> 1. 2.5M CSV files
> 2. Each CSV file: header of 50-60 columns
> 2. Each CSV file: one data row
>
> there are common columns between the CSV file but I dont know them in
> advance.I would like to have all the csv files in one bigQuery table.
>
> My current method:
> When it was smaller amount of files, I read the csv files and appended
> them to one pandas dataframe that was written to a file (total.csv).
> total.csv was the input to the beam pipeline.
>
> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>
> The challenge with that approach is that the pandas will require large
> memory in order to hold the 2.5M csv files before writing them to BQ.
>
> Is there a different way to pipe the CSVs to BQ? One option will be to
> split the CSVs to batchs and write them to different BQ tables or append 
> to
> one table.
>
> Any thoughts how to do it without extra coding?
>
> Many thanks,
> --
> Eila
> www.orielresearch.org
> https://www.meetu
> p.co
> 

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread Ankur Goenka
Hi Eila,

That seems reasonable to me.

Here is a reference on writing to BQ
https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326

May I know how many distinct column are you expecting across all files?


On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Hi Ankur / users,
>
> I would like to make sure that the suggested pipeline can work for my
> needs.
>
> So, additional clarification:
> - The CSV files have few common and few different columns. Each CSV file
> represent a sample measurements record.
>
> - When the CSVs merged together, I expect to have one table with combined
> columns from all samples. Will your suggested pipeline allow that?
>
> - If I understand correctly the following pipeline:
>
> *Read files => Parse lines => Generate pCollections for each column =>
> GroupBy column name => Write to BQ*
>
> *Read files:* will generate a pCollection from each CSV line (=file)
> *Parse Lines*: will extract the column name and its matching value
> (generate a tuple)
> *Generate pCollection for each column:* will generate a pCollection from
> the tuple
> *GroupBy: *will merge each column name with all the relevant samples
> values (does it need to know the column names to group by or will it
> automatically use the tuple key? )
> *WriteToBQ*: will NOT (is that true?) be able to write the values
> matching the relevant sample. The samples that didnt have value for a
> specific key will get values for other samples values that are available at
> the column's collection record
>
> - If the above understanding correct, will extracting & merging the
> headers from all 2.5M CSV files in advance to derived the schema and then
> using an additional pipeline to populate the data to BQ with schema will be
> the "right" approach?
>
> Please let me know if I miss something here / what your thoughts are
> Many thanks,
> Eila
>
> On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hi Ankur,
>>
>> Thank you. Trying this approach now. Will let you know if I have any
>> issue implementing it.
>> Best,
>> Eila
>>
>> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka  wrote:
>>
>>> Hi Eila,
>>>
>>> If I understand correctly, the objective is to read a large number of
>>> CSV files, each of which contains a single row with multiple columns.
>>> Deduplicate the columns in the file and write them to BQ.
>>> You are using pandas DF to deduplicate the columns for a small set of
>>> files which might not work for large number of files.
>>>
>>> You can use beam groupBy to deduplicate the columns and write them to
>>> bigquery. Beam is capable of reading and managing large number of files by
>>> providing path to the directory containing those files.
>>> So the approach would be ->
>>> Read files => Parse lines => Generate pCollections for each column =>
>>> GroupBy column name => Write to BQ
>>> For reference here is an example of reading file and doing groupby
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>>>
>>> Note: I am not very familiar with BQ so can't think of any direct
>>> approach to dump data to BQ.
>>>
>>> Thanks,
>>> Ankur
>>>
>>>
>>> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
>>> e...@orielresearch.org> wrote:
>>>
 Hello,
 I would like to write large number of CSV file to BQ where the headers
 from all of them is aggregated to one common headers. any advice is very
 appreciated.

 The details are:
 1. 2.5M CSV files
 2. Each CSV file: header of 50-60 columns
 2. Each CSV file: one data row

 there are common columns between the CSV file but I dont know them in
 advance.I would like to have all the csv files in one bigQuery table.

 My current method:
 When it was smaller amount of files, I read the csv files and appended
 them to one pandas dataframe that was written to a file (total.csv).
 total.csv was the input to the beam pipeline.

 small CSVs => Pandas DF => total CSV => pCollection => Big Query

 The challenge with that approach is that the pandas will require large
 memory in order to hold the 2.5M csv files before writing them to BQ.

 Is there a different way to pipe the CSVs to BQ? One option will be to
 split the CSVs to batchs and write them to different BQ tables or append to
 one table.

 Any thoughts how to do it without extra coding?

 Many thanks,
 --
 Eila
 www.orielresearch.org
 https://www.meetu 
 p.co 
 m/Deep-Learning-In-Production/
 



>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu 

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread OrielResearch Eila Arich-Landkof
Hi Ankur,

Thank you. Trying this approach now. Will let you know if I have any issue
implementing it.
Best,
Eila

On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka  wrote:

> Hi Eila,
>
> If I understand correctly, the objective is to read a large number of CSV
> files, each of which contains a single row with multiple columns.
> Deduplicate the columns in the file and write them to BQ.
> You are using pandas DF to deduplicate the columns for a small set of
> files which might not work for large number of files.
>
> You can use beam groupBy to deduplicate the columns and write them to
> bigquery. Beam is capable of reading and managing large number of files by
> providing path to the directory containing those files.
> So the approach would be ->
> Read files => Parse lines => Generate pCollections for each column =>
> GroupBy column name => Write to BQ
> For reference here is an example of reading file and doing groupby
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>
> Note: I am not very familiar with BQ so can't think of any direct approach
> to dump data to BQ.
>
> Thanks,
> Ankur
>
>
> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hello,
>> I would like to write large number of CSV file to BQ where the headers
>> from all of them is aggregated to one common headers. any advice is very
>> appreciated.
>>
>> The details are:
>> 1. 2.5M CSV files
>> 2. Each CSV file: header of 50-60 columns
>> 2. Each CSV file: one data row
>>
>> there are common columns between the CSV file but I dont know them in
>> advance.I would like to have all the csv files in one bigQuery table.
>>
>> My current method:
>> When it was smaller amount of files, I read the csv files and appended
>> them to one pandas dataframe that was written to a file (total.csv).
>> total.csv was the input to the beam pipeline.
>>
>> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>>
>> The challenge with that approach is that the pandas will require large
>> memory in order to hold the 2.5M csv files before writing them to BQ.
>>
>> Is there a different way to pipe the CSVs to BQ? One option will be to
>> split the CSVs to batchs and write them to different BQ tables or append to
>> one table.
>>
>> Any thoughts how to do it without extra coding?
>>
>> Many thanks,
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu 
>> p.co 
>> m/Deep-Learning-In-Production/
>> 
>>
>>
>>

-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/



Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-25 Thread Ankur Goenka
Hi Eila,

If I understand correctly, the objective is to read a large number of CSV
files, each of which contains a single row with multiple columns.
Deduplicate the columns in the file and write them to BQ.
You are using pandas DF to deduplicate the columns for a small set of files
which might not work for large number of files.

You can use beam groupBy to deduplicate the columns and write them to
bigquery. Beam is capable of reading and managing large number of files by
providing path to the directory containing those files.
So the approach would be ->
Read files => Parse lines => Generate pCollections for each column =>
GroupBy column name => Write to BQ
For reference here is an example of reading file and doing groupby
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py

Note: I am not very familiar with BQ so can't think of any direct approach
to dump data to BQ.

Thanks,
Ankur


On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Hello,
> I would like to write large number of CSV file to BQ where the headers
> from all of them is aggregated to one common headers. any advice is very
> appreciated.
>
> The details are:
> 1. 2.5M CSV files
> 2. Each CSV file: header of 50-60 columns
> 2. Each CSV file: one data row
>
> there are common columns between the CSV file but I dont know them in
> advance.I would like to have all the csv files in one bigQuery table.
>
> My current method:
> When it was smaller amount of files, I read the csv files and appended
> them to one pandas dataframe that was written to a file (total.csv).
> total.csv was the input to the beam pipeline.
>
> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>
> The challenge with that approach is that the pandas will require large
> memory in order to hold the 2.5M csv files before writing them to BQ.
>
> Is there a different way to pipe the CSVs to BQ? One option will be to
> split the CSVs to batchs and write them to different BQ tables or append to
> one table.
>
> Any thoughts how to do it without extra coding?
>
> Many thanks,
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co 
> m/Deep-Learning-In-Production/
> 
>
>
>


Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-25 Thread OrielResearch Eila Arich-Landkof
Hello,
I would like to write large number of CSV file to BQ where the headers from
all of them is aggregated to one common headers. any advice is very
appreciated.

The details are:
1. 2.5M CSV files
2. Each CSV file: header of 50-60 columns
2. Each CSV file: one data row

there are common columns between the CSV file but I dont know them in
advance.I would like to have all the csv files in one bigQuery table.

My current method:
When it was smaller amount of files, I read the csv files and appended them
to one pandas dataframe that was written to a file (total.csv). total.csv
was the input to the beam pipeline.

small CSVs => Pandas DF => total CSV => pCollection => Big Query

The challenge with that approach is that the pandas will require large
memory in order to hold the 2.5M csv files before writing them to BQ.

Is there a different way to pipe the CSVs to BQ? One option will be to
split the CSVs to batchs and write them to different BQ tables or append to
one table.

Any thoughts how to do it without extra coding?

Many thanks,
-- 
Eila
www.orielresearch.org
https://www.meetu p.co

m/Deep-Learning-In-Production/