I think you need to do something like: Read (200) -> GroupByKey (200) -> UnGroup(200) [Not this can be on 200 different workers] -> combine (20M) -> clean (20M) -> filter (20M) -> insert
On Mon, Jun 5, 2017 at 1:14 PM Morand, Sebastien < sebastien.mor...@veolia.com> wrote: > Yes fusion looks like my problem. A job ID to look at: > 2017-06-05_10_14_25-5856213199384263626. > > The point is in your link: > << > For example, one case in which fusion can limit Dataflow's ability to > optimize worker usage is a "high fan-out" ParDo. In such an operation, you > might have an input collection with relatively few elements, but the ParDo > produces an output with hundreds or thousands of times as many elements, > followed by another ParDo > >> > > This is exactly what I'm doing in the step transform-combine-7d5ad942 in > the above job id. > > As fas as I understand, I should create a GroupByKey after the > transform-combine-7d5ad942 on a unique field and then ungroup the data? > (meaning I add two operations in the pipeline to help the worker? > > Right now: > Read (200) -> combine (20M) -> clean (20M) -> filter (20M) -> insert > > Will become: > Read (200) -> combine (20M) -> GroupByKey (20M) -> ungroup (20M) -> clean > (20M) -> filter (20M) -> insert > > It this the right way? > > > > > *Sébastien MORAND* > Team Lead Solution Architect > Technology & Operations / Digital Factory > Veolia - Group Information Systems & Technology (IS&T) > Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 > <+33%201%2085%2057%2071%2008> > Bureau 0144C (Ouest) > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France > *www.veolia.com <http://www.veolia.com>* > <http://www.veolia.com> > <https://www.facebook.com/veoliaenvironment/> > <https://www.youtube.com/user/veoliaenvironnement> > <https://www.linkedin.com/company/veolia-environnement> > <https://twitter.com/veolia> > > On 5 June 2017 at 21:42, Eugene Kirpichov <kirpic...@google.com> wrote: > >> Do you have a Dataflow job ID to look at? >> It might be due to fusion >> https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion >> >> On Mon, Jun 5, 2017 at 12:13 PM Prabeesh K. <prabsma...@gmail.com> wrote: >> >>> Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8 >>> >>> On 5 June 2017 at 23:08, Morand, Sebastien <sebastien.mor...@veolia.com> >>> wrote: >>> >>>> I do have a problem with my tries to test scaling on dataflow. >>>> >>>> My dataflow is pretty simple: I get a list of files from pubsub, so the >>>> number of files I'm going to use to feed the flow is well known at the >>>> begining. Here are my steps: >>>> Let's say I have 200 files containing about 20,000,000 of records >>>> >>>> - *First Step:* Read file contents from storage: files are .tar.gz >>>> containing each 4 files (CSV). I return the file content as the whole >>>> in a >>>> record >>>> *OUT:* 200 records (one for each file containing the data of all 4 >>>> files). Bascillacy it's a dict : {file1: content_of_file1, file2: >>>> content_of_file2, etc...} >>>> >>>> - *Second step:* Joining the data of the 4 files in one record >>>> (the main file contains foreign key to get information from the other >>>> files) >>>> *OUT:* 20,000,000 records each for every line in the files. Each >>>> record is a list of string >>>> >>>> - *Third step:* cleaning data (convert to prepare integration in >>>> bigquery) and set them as a dict where keys are bigquery column name. >>>> *OUT:* 20,000,000 records as dict for each record >>>> >>>> - *Fourth step:* insert into bigquery >>>> >>>> So the first step return 200 records, but I have 20,000,000 records to >>>> insert. >>>> This takes about 1 hour and half and always use 1 worker ... >>>> >>>> If I manually set the number of workers, it's not really faster. So for >>>> an unknow reason, it doesn't scale, any ideas how to do it? >>>> >>>> Thanks for any help. >>>> >>>> *Sébastien MORAND* >>>> Team Lead Solution Architect >>>> Technology & Operations / Digital Factory >>>> Veolia - Group Information Systems & Technology (IS&T) >>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >>>> <+33%201%2085%2057%2071%2008> >>>> Bureau 0144C (Ouest) >>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >>>> *www.veolia.com <http://www.veolia.com>* >>>> <http://www.veolia.com> >>>> <https://www.facebook.com/veoliaenvironment/> >>>> <https://www.youtube.com/user/veoliaenvironnement> >>>> <https://www.linkedin.com/company/veolia-environnement> >>>> <https://twitter.com/veolia> >>>> >>>> >>>> >>>> -------------------------------------------------------------------------------------------- >>>> This e-mail transmission (message and any attached files) may contain >>>> information that is proprietary, privileged and/or confidential to Veolia >>>> Environnement and/or its affiliates and is intended exclusively for the >>>> person(s) to whom it is addressed. If you are not the intended recipient, >>>> please notify the sender by return e-mail and delete all copies of this >>>> e-mail, including all attachments. Unless expressly authorized, any use, >>>> disclosure, publication, retransmission or dissemination of this e-mail >>>> and/or of its attachments is strictly prohibited. >>>> >>>> Ce message electronique et ses fichiers attaches sont strictement >>>> confidentiels et peuvent contenir des elements dont Veolia Environnement >>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc >>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce >>>> message par erreur, merci de le retourner a son emetteur et de le detruire >>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la >>>> publication, la distribution, ou la reproduction non expressement >>>> autorisees de ce message et de ses pieces attachees sont interdites. >>>> >>>> -------------------------------------------------------------------------------------------- >>>> >>> >>> > > > > -------------------------------------------------------------------------------------------- > This e-mail transmission (message and any attached files) may contain > information that is proprietary, privileged and/or confidential to Veolia > Environnement and/or its affiliates and is intended exclusively for the > person(s) to whom it is addressed. If you are not the intended recipient, > please notify the sender by return e-mail and delete all copies of this > e-mail, including all attachments. Unless expressly authorized, any use, > disclosure, publication, retransmission or dissemination of this e-mail > and/or of its attachments is strictly prohibited. > > Ce message electronique et ses fichiers attaches sont strictement > confidentiels et peuvent contenir des elements dont Veolia Environnement > et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc > destines a l'usage de leurs seuls destinataires. Si vous avez recu ce > message par erreur, merci de le retourner a son emetteur et de le detruire > ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la > publication, la distribution, ou la reproduction non expressement > autorisees de ce message et de ses pieces attachees sont interdites. > > -------------------------------------------------------------------------------------------- >