[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116879#comment-17116879 ] Steve Niemitz commented on BEAM-9383: - {quote}Steve, why would two workers jars be staged at the same location ? {quote} Job A runs with `--dataflowWorkerJar=myjar1.jar`, staging path = `gs://somebucket/`, the jar gets uploaded as `gs://somebucket/dataflow-worker.jar`. Job B runs with `–dataflowWorkerJar=myjar2.jar`, same staging path, the jar is uploaded as gs://somebucket/dataflow-worker.jar and overwrites the one from Job A. Previously, the first jar would have been uploaded as gs://somebucket/myjar1-.jar, and the second as gs://somebucket/myjar2-.jar, so the names wouldn't collide. {quote}Also, I'm not sure why we started picking up new jars from JRE libraries for staging {quote} I think this might actually have been unrelated to this change, even after reverting the commit I'm still seeing some "bonus" jars get staged. > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116875#comment-17116875 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - Steve, why would two workers jars be staged at the same location ? Also, I'm not sure why we started picking up new jars from JRE libraries for staging. Weren't we staging everything in the CLASSPATH anyways ? > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116185#comment-17116185 ] Steve Niemitz commented on BEAM-9383: - There's actually another much more serious bug here. If I set a dataflow worker jar, it gets uploaded to GCS as simply "dataflow-worker.jar". This means if I have two jobs that have two different worker jars set, they'll overwrite each other and the last one will win. The old behavior was correct, the name used was the name of the jar + the SHA256, just like other classpath elements that were staged. > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116181#comment-17116181 ] Steve Niemitz commented on BEAM-9383: - I was just validating the 2.22 branch and this seems to have caused some pretty large regressions. As mentioned above, it attempted to stage a bunch of my JRE libraries themselves, but it also uploaded duplicate artifacts in a couple cases. It seems like it no longer deterministically names the artifacts, so that every time I launch a pipeline it'll re-upload the jar? > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113468#comment-17113468 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - Removing this from the blockers list. > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P2 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112909#comment-17112909 ] Heejong Lee commented on BEAM-9383: --- [https://github.com/apache/beam/pull/11771] could mitigate the problem by removing duplicated artifacts from multiple environments. > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112678#comment-17112678 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - Seems like it's picking up jars from the Java runtime. nashorn.jar ldrdata.jar jfxrt.jar dnsns.jar localedata.jar MRJToolkit.jar beam-sdks-java-io-expansion-service-2.22.0-SNAPSHOT.jar Also part of the problem is that beam-sdks-java-io-expansion-service-2.22.0-SNAPSHOT.jar is 51MB and takes a long time to stage. We stage two of each of the above since we have both Kafka read and write transforms in the pipeline. Can we somehow exclude jars from the Java runtime here ? [https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L324] > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112649#comment-17112649 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - Note that I'm running from Beam HEAD without specifying additional dependencies or an expansion service. Pipeline is here: [https://paste.ofcode.org/32sxtbEGuzqbw4d7PKMiC6V] > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112645#comment-17112645 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - I tried running a Kafka pipeline on Dataflow and I see a lot of jars being staged during pipeline submission. INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/pipeline.pb in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/596ab8b3-840a-43ff-accb-8f6815e1a302.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/596ab8b3-840a-43ff-accb-8f6815e1a302.jar in 24 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/6f80255b-453f-4ad8-aa28-7e40fdfeedac.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/6f80255b-453f-4ad8-aa28-7e40fdfeedac.jar in 22 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/2704b169-8874-4163-9f3c-ab8765f3c330.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/2704b169-8874-4163-9f3c-ab8765f3c330.jar in 69 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/40bd912f-ce2f-45a8-9625-019b85c46cc7.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/40bd912f-ce2f-45a8-9625-019b85c46cc7.jar in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/9d1bfb42-518d-4cc7-9a3a-7a8ea792ce6f.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/9d1bfb42-518d-4cc7-9a3a-7a8ea792ce6f.jar in 8 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/7e1e7095-32d6-4ea6-b9a0-aa5e2ffdbb31.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/7e1e7095-32d6-4ea6-b9a0-aa5e2ffdbb31.jar in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/926b735c-552f-4a3a-9e81-f0fe8162ce26.jar... INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/926b735c-552f-4a3a-9e81-f0fe8162ce26.jar in 0 seconds. INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/e9f829ba-eadf-4ae4-98c4-492238cb9998.jar... ... Ideally there should be only one jar, beam-sdks-java-io-expansion-service-2.22.0-SNAPSHOT.ja Any idea where additional jars are coming from. Also can we use names of jars instread of URLs so that we can easily identify what these are ? cc: [~robertwb] [~lcwik] > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111366#comment-17111366 ] Brian Hulette commented on BEAM-9383: - Can this be closed now that https://github.com/apache/beam/pull/11039 is merged? > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 12h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110523#comment-17110523 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - Work after [https://github.com/apache/beam/pull/11039] (updating Dataflow to separate dependencies for multiple environments) is not a blocker. > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 9h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment
[ https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110521#comment-17110521 ] Chamikara Madhusanka Jayalath commented on BEAM-9383: - Changing to a blocker to get [https://github.com/apache/beam/pull/11039] into Beam 2.22.0. > Staging Dataflow artifacts from environment > --- > > Key: BEAM-9383 > URL: https://issues.apache.org/jira/browse/BEAM-9383 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: Heejong Lee >Assignee: Heejong Lee >Priority: P0 > Fix For: 2.22.0 > > Time Spent: 9h > Remaining Estimate: 0h > > Staging Dataflow artifacts from environment -- This message was sent by Atlassian Jira (v8.3.4#803005)