Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8 On 5 June 2017 at 23:08, Morand, Sebastien <sebastien.mor...@veolia.com> wrote:
> I do have a problem with my tries to test scaling on dataflow. > > My dataflow is pretty simple: I get a list of files from pubsub, so the > number of files I'm going to use to feed the flow is well known at the > begining. Here are my steps: > Let's say I have 200 files containing about 20,000,000 of records > > - *First Step:* Read file contents from storage: files are .tar.gz > containing each 4 files (CSV). I return the file content as the whole in a > record > *OUT:* 200 records (one for each file containing the data of all 4 > files). Bascillacy it's a dict : {file1: content_of_file1, file2: > content_of_file2, etc...} > > - *Second step:* Joining the data of the 4 files in one record (the > main file contains foreign key to get information from the other files) > *OUT:* 20,000,000 records each for every line in the files. Each > record is a list of string > > - *Third step:* cleaning data (convert to prepare integration in > bigquery) and set them as a dict where keys are bigquery column name. > *OUT:* 20,000,000 records as dict for each record > > - *Fourth step:* insert into bigquery > > So the first step return 200 records, but I have 20,000,000 records to > insert. > This takes about 1 hour and half and always use 1 worker ... > > If I manually set the number of workers, it's not really faster. So for an > unknow reason, it doesn't scale, any ideas how to do it? > > Thanks for any help. > > *Sébastien MORAND* > Team Lead Solution Architect > Technology & Operations / Digital Factory > Veolia - Group Information Systems & Technology (IS&T) > Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 > <+33%201%2085%2057%2071%2008> > Bureau 0144C (Ouest) > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France > *www.veolia.com <http://www.veolia.com>* > <http://www.veolia.com> > <https://www.facebook.com/veoliaenvironment/> > <https://www.youtube.com/user/veoliaenvironnement> > <https://www.linkedin.com/company/veolia-environnement> > <https://twitter.com/veolia> > > > ------------------------------------------------------------ > -------------------------------- > This e-mail transmission (message and any attached files) may contain > information that is proprietary, privileged and/or confidential to Veolia > Environnement and/or its affiliates and is intended exclusively for the > person(s) to whom it is addressed. If you are not the intended recipient, > please notify the sender by return e-mail and delete all copies of this > e-mail, including all attachments. Unless expressly authorized, any use, > disclosure, publication, retransmission or dissemination of this e-mail > and/or of its attachments is strictly prohibited. > > Ce message electronique et ses fichiers attaches sont strictement > confidentiels et peuvent contenir des elements dont Veolia Environnement > et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc > destines a l'usage de leurs seuls destinataires. Si vous avez recu ce > message par erreur, merci de le retourner a son emetteur et de le detruire > ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la > publication, la distribution, ou la reproduction non expressement > autorisees de ce message et de ses pieces attachees sont interdites. > ------------------------------------------------------------ > -------------------------------- >