Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8

On 5 June 2017 at 23:08, Morand, Sebastien <sebastien.mor...@veolia.com>
wrote:

> I do have a problem with my tries to test scaling on dataflow.
>
> My dataflow is pretty simple: I get a list of files from pubsub, so the
> number of files I'm going to use to feed the flow is well known at the
> begining. Here are my steps:
> Let's say I have 200 files containing about 20,000,000 of records
>
>    - *First Step:* Read file contents from storage: files are .tar.gz
>    containing each 4 files (CSV). I return the file content as the whole in a
>    record
>    *OUT:* 200 records (one for each file containing the data of all 4
>    files). Bascillacy it's a dict : {file1: content_of_file1, file2:
>    content_of_file2, etc...}
>
>    - *Second step:*  Joining the data of the 4 files in one record (the
>    main file contains foreign key to get information from the other files)
>    *OUT:* 20,000,000 records each for every line in the files. Each
>    record is a list of string
>
>    - *Third step:* cleaning data (convert to prepare integration in
>    bigquery) and set them as a dict where keys are bigquery column name.
>    *OUT:* 20,000,000 records as dict for each record
>
>    - *Fourth step:* insert into bigquery
>
> So the first step return 200 records, but I have 20,000,000 records to
> insert.
> This takes about 1 hour and half and always use 1 worker ...
>
> If I manually set the number of workers, it's not really faster. So for an
> unknow reason, it doesn't scale, any ideas how to do it?
>
> Thanks for any help.
>
> *Sébastien MORAND*
> Team Lead Solution Architect
> Technology & Operations / Digital Factory
> Veolia - Group Information Systems & Technology (IS&T)
> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
> <+33%201%2085%2057%2071%2008>
> Bureau 0144C (Ouest)
> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
> *www.veolia.com <http://www.veolia.com>*
> <http://www.veolia.com>
> <https://www.facebook.com/veoliaenvironment/>
> <https://www.youtube.com/user/veoliaenvironnement>
> <https://www.linkedin.com/company/veolia-environnement>
> <https://twitter.com/veolia>
>
>
> ------------------------------------------------------------
> --------------------------------
> This e-mail transmission (message and any attached files) may contain
> information that is proprietary, privileged and/or confidential to Veolia
> Environnement and/or its affiliates and is intended exclusively for the
> person(s) to whom it is addressed. If you are not the intended recipient,
> please notify the sender by return e-mail and delete all copies of this
> e-mail, including all attachments. Unless expressly authorized, any use,
> disclosure, publication, retransmission or dissemination of this e-mail
> and/or of its attachments is strictly prohibited.
>
> Ce message electronique et ses fichiers attaches sont strictement
> confidentiels et peuvent contenir des elements dont Veolia Environnement
> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
> message par erreur, merci de le retourner a son emetteur et de le detruire
> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
> publication, la distribution, ou la reproduction non expressement
> autorisees de ce message et de ses pieces attachees sont interdites.
> ------------------------------------------------------------
> --------------------------------
>

Reply via email to