[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-26 Thread Steve Niemitz (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116879#comment-17116879
 ] 

Steve Niemitz commented on BEAM-9383:
-

{quote}Steve, why would two workers jars be staged at the same location ?  
{quote}
Job A runs with `--dataflowWorkerJar=myjar1.jar`, staging path = 
`gs://somebucket/`, the jar gets uploaded as 
`gs://somebucket/dataflow-worker.jar`.

Job B runs with `–dataflowWorkerJar=myjar2.jar`, same staging path, the jar is 
uploaded as gs://somebucket/dataflow-worker.jar and overwrites the one from Job 
A.  Previously, the first jar would have been uploaded as 
gs://somebucket/myjar1-.jar, and the second as 
gs://somebucket/myjar2-.jar, so the names wouldn't collide.
{quote}Also, I'm not sure why we started picking up new jars from JRE libraries 
for staging
{quote}
I think this might actually have been unrelated to this change, even after 
reverting the commit I'm still seeing some "bonus" jars get staged.

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-26 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116875#comment-17116875
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

Steve, why would two workers jars be staged at the same location ? 

 

Also, I'm not sure why we started picking up new jars from JRE libraries for 
staging. Weren't we staging everything in the CLASSPATH anyways ?

 

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-25 Thread Steve Niemitz (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116185#comment-17116185
 ] 

Steve Niemitz commented on BEAM-9383:
-

There's actually another much more serious bug here.  If I set a dataflow 
worker jar, it gets uploaded to GCS as simply "dataflow-worker.jar".  This 
means if I have two jobs that have two different worker jars set, they'll 
overwrite each other and the last one will win.

The old behavior was correct, the name used was the name of the jar + the 
SHA256, just like other classpath elements that were staged.

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-25 Thread Steve Niemitz (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116181#comment-17116181
 ] 

Steve Niemitz commented on BEAM-9383:
-

I was just validating the 2.22 branch and this seems to have caused some pretty 
large regressions.  As mentioned above, it attempted to stage a bunch of my JRE 
libraries themselves, but it also uploaded duplicate artifacts in a couple 
cases.  It seems like it no longer deterministically names the artifacts, so 
that every time I launch a pipeline it'll re-upload the jar?

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-21 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113468#comment-17113468
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

Removing this from the blockers list.

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P2
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-21 Thread Heejong Lee (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112909#comment-17112909
 ] 

Heejong Lee commented on BEAM-9383:
---

[https://github.com/apache/beam/pull/11771] could mitigate the problem by 
removing duplicated artifacts from multiple environments.

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112678#comment-17112678
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

Seems like it's picking up jars from the Java runtime.

nashorn.jar
ldrdata.jar
jfxrt.jar
dnsns.jar
localedata.jar
MRJToolkit.jar
beam-sdks-java-io-expansion-service-2.22.0-SNAPSHOT.jar

 

Also part of the problem is that 
beam-sdks-java-io-expansion-service-2.22.0-SNAPSHOT.jar is 51MB and takes a 
long time to stage.

 

We stage two of each of the above since we have both Kafka read and write 
transforms in the pipeline.

 

Can we somehow exclude jars from the Java runtime here ?

[https://github.com/apache/beam/blob/master/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionService.java#L324]

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112649#comment-17112649
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

Note that I'm running from Beam HEAD without specifying additional dependencies 
or an expansion service. Pipeline is here:

[https://paste.ofcode.org/32sxtbEGuzqbw4d7PKMiC6V]

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-20 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112645#comment-17112645
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

I tried running a Kafka pipeline on Dataflow and I see a lot of jars being 
staged during pipeline submission.

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/pipeline.pb
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/596ab8b3-840a-43ff-accb-8f6815e1a302.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/596ab8b3-840a-43ff-accb-8f6815e1a302.jar
 in 24 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/6f80255b-453f-4ad8-aa28-7e40fdfeedac.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/6f80255b-453f-4ad8-aa28-7e40fdfeedac.jar
 in 22 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/2704b169-8874-4163-9f3c-ab8765f3c330.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/2704b169-8874-4163-9f3c-ab8765f3c330.jar
 in 69 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/40bd912f-ce2f-45a8-9625-019b85c46cc7.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/40bd912f-ce2f-45a8-9625-019b85c46cc7.jar
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/9d1bfb42-518d-4cc7-9a3a-7a8ea792ce6f.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/9d1bfb42-518d-4cc7-9a3a-7a8ea792ce6f.jar
 in 8 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/7e1e7095-32d6-4ea6-b9a0-aa5e2ffdbb31.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/7e1e7095-32d6-4ea6-b9a0-aa5e2ffdbb31.jar
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/926b735c-552f-4a3a-9e81-f0fe8162ce26.jar...

INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/926b735c-552f-4a3a-9e81-f0fe8162ce26.jar
 in 0 seconds.

INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to 
gs://clouddfe-chamikara/staging/python-222-wc-chamikara.1590014685.641805/e9f829ba-eadf-4ae4-98c4-492238cb9998.jar...

...

 

 

Ideally there should be only one jar, 

beam-sdks-java-io-expansion-service-2.22.0-SNAPSHOT.ja

 

Any idea where additional jars are coming from. Also can we use names of jars 
instread of URLs so that we can easily identify what these are ?

 

cc: [~robertwb] [~lcwik]

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-19 Thread Brian Hulette (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111366#comment-17111366
 ] 

Brian Hulette commented on BEAM-9383:
-

Can this be closed now that https://github.com/apache/beam/pull/11039 is merged?

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-18 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110523#comment-17110523
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

Work after  [https://github.com/apache/beam/pull/11039]  (updating Dataflow to 
separate dependencies for multiple environments) is not a blocker.

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9383) Staging Dataflow artifacts from environment

2020-05-18 Thread Chamikara Madhusanka Jayalath (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110521#comment-17110521
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-9383:
-

Changing to a blocker to get [https://github.com/apache/beam/pull/11039] into 
Beam 2.22.0.

> Staging Dataflow artifacts from environment
> ---
>
> Key: BEAM-9383
> URL: https://issues.apache.org/jira/browse/BEAM-9383
> Project: Beam
>  Issue Type: Sub-task
>  Components: java-fn-execution
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: P0
> Fix For: 2.22.0
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> Staging Dataflow artifacts from environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)