Hi,
Thank you and congratulations on graduating the Go SDK from experimental
status- it's all very exciting. We're already successfully using it for several
of our internal pipelines, but I've now run into an issue writing data using
the pubsubio.Write transform and I am unable to figure it out.
In the following example I expect data published on the input topic to be
forwarded to the output topic verbatim. However Dataflow does not output
anything on the configured topic.
Examining metrics, data is read successfully but there are no publish requests
on the output topic. As more data is read from the input topic, throughput
keeps growing along with data freshness. There are no logged errors or failures
that I can find.
The SDK version is v2.34.0, but I am seeing the same issue on current master
(go get: downgraded github.com/apache/beam/sdks/v2 v2.34.0 =>
v2.0.0-20211116020159-9de4bd93fb49).
The example is a simpler version of my actual pipeline that also does
intermediate data processing and includes multiple stages, but it has the same
problem.
> package main
>
> import (
> "context"
> "flag"
>
> "github.com/apache/beam/sdks/v2/go/pkg/beam"
> "github.com/apache/beam/sdks/v2/go/pkg/beam/io/pubsubio"
> "github.com/apache/beam/sdks/v2/go/pkg/beam/log"
> "github.com/apache/beam/sdks/v2/go/pkg/beam/options/gcpopts"
> _ "github.com/apache/beam/sdks/v2/go/pkg/beam/runners/dataflow"
> )
>
> func main() {
> var input, output string
> flag.StringVar(&input, "input", "", "")
> flag.StringVar(&output, "output", "", "")
> flag.Parse()
> beam.Init()
>
> ctx := context.Background()
> project := gcpopts.GetProject(ctx)
>
> pipeline, scope := beam.NewPipelineWithRoot()
>
> pubsubio.Write(
> scope,
> project,
> output,
> pubsubio.Read(
> scope,
> project,
> input,
> nil,
> ),
> )
>
> if _, err := beam.Run(context.Background(), "dataflow", pipeline); err
> != nil {
> log.Exitf(ctx, "Failed to run job: %v", err)
> }
> }
Dataflow job 2021-11-16_07_43_39-16986075748673839415 is currently executing
the above pipeline and I'm just about to see if I can reproduce this with the
Python SDK. There is no new input being produced, but the pipeline should
contain previously published data.
Note that I'm aware it is an external transform performed internally by the
runner and that Dataflow does not support the Go SDK yet, but any insights
would be helpful.
The answer in https://stackoverflow.com/a/69665998 proposes implementing a
custom transform that uses the official pubsub package, which seems like a
reasonable short-term workaround if I am unable to find a solution.
Let me know if I can provide any additional information that would be helpful.
Thanks,
--
Hannes