sal: is there something that stops us from using
>>> RowCoder by default here? I assume all forms of "PubsubMessage" can be
>>> encoded with RowCoder, it provides flexibility for future changes, and
>>> PubSub will be able to take advantage of future work to make Row
Bump
Any other risks or drawbacks associated with altering the default coder for
PubsubMessage to be the most inclusive coder with respect to possible
fields?
Thanks,
Evan
On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin wrote:
> Hi folks,
>
> I'd like to solicit feedback on t
Hi folks,
I'd like to solicit feedback on the notion of using
PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
default coder for Pubsub messages instead of the current default of
PubsubMessageWithAttributesCoder.
Not long ago, support for reading and writing Pubsub messages
friendly bump Anyone have thoughts or answers? Thanks!
On Thu, Nov 3, 2022 at 3:07 PM Evan Galpin wrote:
> Hi folks,
>
> Hoping to get some definitive answers with respect to streaming pipeline
> bundle retry semantics on Dataflow. I understand that a bundle containing
>
Hi folks,
Hoping to get some definitive answers with respect to streaming pipeline
bundle retry semantics on Dataflow. I understand that a bundle containing
a "poison pill" (bad data, let's say it causes a null pointer exception
when processing in DoFn) will be retried indefinitely. What I'm
+dev for additional visibility/input
On Mon, Aug 29, 2022 at 11:10 AM Leandro Nahabedian via user <
u...@beam.apache.org> wrote:
> Hi community!
>
> I'm looking for a good course to learn apache beam and I saw this one
>
Congrats John!
On Fri, Jul 29, 2022 at 17:18 Ahmet Altay via dev
wrote:
> Congratulations John!
>
> On Fri, Jul 29, 2022 at 2:03 PM Svetak Sundhar via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats John!!
>>
>>
>> Svetak Sundhar
>>
>> Technical Solutions Engineer, Data
>> s
One final note of clarification: the pom file needs to be in the same
directory as the jar
On Fri, Jul 22, 2022 at 11:01 Evan Galpin wrote:
> It's working! Huge thank you to Steve Niemitz who pointed out the need for
> "--experiments=enable_custom_pubsub_sink" to prevent d
HOT")
}
}
}
8. Deploy the user code pipeline including the flag:
--experiments=enable_custom_pubsub_sink
On Thu, Jul 21, 2022 at 4:42 PM Evan Galpin wrote:
> Thanks Tomo, I'll check that out too as a good safeguard! Are you
> familiar with any process to build pre-release
Admittedly this is potentially self-serving, but I feel there could be
mutual benefit.
I have a similar situation where I want to use pre-release version of
beam-sdks-java-io-google-cloud-platform. Though I’ve been having trouble
doing so, a possible alternative solution to using the nightly
t;
> On Thu, Jul 21, 2022 at 3:35 PM Evan Galpin wrote:
>
>> Spoke too soon... still can't seem to get the new behaviour to appear in
>> dataflow, possibly something is being overridden?
>>
>> On Thu, Jul 21, 2022 at 3:15 PM Evan Galpin wrote:
>>
>>> M
Spoke too soon... still can't seem to get the new behaviour to appear in
dataflow, possibly something is being overridden?
On Thu, Jul 21, 2022 at 3:15 PM Evan Galpin wrote:
> Making a shadowJar from "beam-sdks-java-io-google-cloud-platform" looks to
> be work
en
deploying the job to dataflow. Looks ok.
On Thu, Jul 21, 2022 at 3:02 PM Evan Galpin wrote:
> I believe I have the dependencySubstitution working, but it seems as
> though the substitution is removing transitive deps of
> "beam-sdks-java-io-google-cloud-platform", hmm...
&
I believe I have the dependencySubstitution working, but it seems as though
the substitution is removing transitive deps of
"beam-sdks-java-io-google-cloud-platform", hmm...
On Thu, Jul 21, 2022 at 1:15 PM Evan Galpin wrote:
> Hi all,
>
> I'm trying to test a chang
Hi all,
I'm trying to test a change I've made locally, but by validating it on
Dataflow. It works locally, but I want to validate on Dataflow. I've
tried a few different attempts at module substitution in the build.gradle
config file for the pipeline I'm trying to deploy, but I haven't had any
Congrats! Well deserved!
On Wed, Jul 20, 2022 at 15:17 Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:
> Congrats, Steve!
>
> On Wed, Jul 20, 2022, 9:16 AM Austin Bennett
> wrote:
>
>> Great!
>>
>> On Wed, Jul 20, 2022 at 10:11 AM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
th GCP support. It could be something
>>> internal but the coders for ephemeral steps that Dataflow adds are based
>>> upon existing coders within the graph.
>>>
>>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin wrote:
>>>
>>>> +dev@
>>&g
do an update
> to get the latest version?
>
> Feel free to share the job files with GCP support. It could be something
> internal but the coders for ephemeral steps that Dataflow adds are based
> upon existing coders within the graph.
>
> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin
t; is not present in the job file, maybe an
internal Dataflow thing?), so I'm confident that the coder hasn't actually
changed.
I'm not sure how to proceed in updating the running pipeline, and I'd
really prefer not to drain. Any ideas?
Thanks,
Evan
On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin
It may also be helpful to explore CoGroupByKey as a way of joining data,
though depending on the shape of the data doing so may not fit in mem:
https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/
- Evan
On Wed, Jun 15, 2022 at 3:45 PM Bruno Volpato
wrote:
> Hello,
>
On Wed, Jun 15, 2022 at 1:46 PM Ahmet Altay wrote:
>
>
> On Wed, Jun 15, 2022 at 10:39 AM Evan Galpin wrote:
>
>> RE https://github.com/apache/beam/issues/21690, I've just opened a PR
>> with a fix: https://github.com/apache/beam/pull/21895. If that feels
>> too
RE https://github.com/apache/beam/issues/21690, I've just opened a PR with
a fix: https://github.com/apache/beam/pull/21895. If that feels too
rushed, I've also opened a PR to revert the change that introduced the
regression in the first place: https://github.com/apache/beam/pull/21884.
For
I had this happen to me recently as well. After `git bisecting` led to
confusing results, I ran my tests again via gradlew adding `--rerun-tasks`
to the command. This is an expensive operation, but after I ran that I was
able to test again with expected results. YMMV
Thanks,
Evan
On Tue, Jun
Congrats Ke!
- Evan
On Mon, May 30, 2022 at 4:11 AM Jan Lukavský wrote:
> Congrats Ke!
>
> Jan
> On 5/29/22 04:12, Yi Pan wrote:
>
> Congrats, Ke!
>
> -Yi
>
> On Sat, May 28, 2022 at 6:57 PM Robert Burke wrote:
>
>> Congratulations!
>> Another place that runs the Go SDK ;)
>>
>> On Fri, May
I believe that this thread is entirely related to an another thread[1]
where there is discussion that the correct fix for this issue could be to
enforce that watermark updates would only happen at bundle boundaries. There’s
another related thread[2] citing the same error with ElasticsearchIO.
Robert Bradshaw wrote:
> > On Mon, Mar 28, 2022 at 11:45 AM Jan Lukavský wrote:
> >> On 3/28/22 20:17, Reuven Lax wrote:
> >>
> >> On Mon, Mar 28, 2022 at 11:08 AM Robert Bradshaw
> wrote:
> >>> On Mon, Mar 28, 2022 at 11:04 AM Reuven Lax wrote:
&g
updating watermarks
>> > only at bundle boundaries. This seems perfectly legal for a pure 1:1
>> > mapping DoFn. The issue is that DoFns are allowed to buffer data and
>> > emit them in a later process (or finishBundle). If the watermark has
>> > moved on, that may result in l
ly legal for a pure 1:1
> > mapping DoFn. The issue is that DoFns are allowed to buffer data and
> > emit them in a later process (or finishBundle). If the watermark has
> > moved on, that may result in late data. We don't really have a way for
> > a DoFn to declare *it's* output
kes sense to explicitly deprecate for a while pending removal.
> Logging a warning is a good idea.
>
> On Fri, Mar 25, 2022 at 10:02 AM Evan Galpin
> wrote:
> >
> > I'm +1 on removing Elasticsearch 5.x support which has not been
> supported by elastic.co for 3 years
I'm +1 on removing Elasticsearch 5.x support which has not been supported
by elastic.co for 3 years according to the link provided. It looks like
support for the most recent version of 6.x was only dropped 1 month ago
according to the same data. I'd say given the recency, we should continue
to
Thanks for starting this thread Jan, I'm keen to hear thoughts and
outcomes! I thought I would mention that answers to the questions posed
here will help to unblock a 2.38.0 release blocker[1].
[1] https://issues.apache.org/jira/browse/BEAM-14064
On Thu, Mar 24, 2022 at 5:28 AM Jan Lukavský
ents of a singular bundle are
processed[2]"
> Within the lifetime of a single bundle, you don't need to worry about
> holding the watermark back as the runners do that for you.
[1] https://lists.apache.org/thread/7m1gx1o7br7jfblhz4yl7czwhhgc7ml3
[2] https://lists.apache.org/thread/fnrlw7rkcg4of5
t; and the StatefulBulkIOFn would process the iterable as a batch that is
> sent to ElasticsearchIO.
>
> 1:
> https://github.com/apache/beam/blob/c76a1c8361b83dd85b6edf0eddd5ebe03873e0ce/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reify.java#L220
>
>
>
&
Congrats Moritz!
On Fri, Mar 11, 2022 at 3:05 AM Etienne Chauchot
wrote:
> Congrats Moritz ! Well deserved !
>
> Etienne
> Le 10/03/2022 à 19:44, Sachin Agarwal a écrit :
>
> Congratulations Moritz!
>
> On Thu, Mar 10, 2022 at 10:44 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>>
n me and the rest of the Beam PMC in welcoming a new committer:
>> Evan Galpin
>>
>> Since joining the Beam community Evan has done lots of contributions to
>> IOs mainly Elasticsearch, but also to SDK transforms. He also gave support
>> on the ML and tested releases.
>&g
et around an implementation detail that is
>> sub-optimal within Dataflow's GroupIntoBatches implementation.
>>
>> On Thu, Mar 10, 2022 at 6:26 AM Kenneth Knowles wrote:
>>
>>>
>>>
>>> On Wed, Mar 9, 2022 at 10:02 AM Evan Galpin
>>> wrote:
/github.com/apache/beam/blob/a126adbc6aa73f1e30adfa65a3710f7f69a7ba89/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java#L313-L318
On Wed, Mar 9, 2022 at 10:13 AM Evan Galpin wrote:
> Oops, unfortunate misfire before being done writing
g) also has
a notion of "getFailedWrites" with the same intent as what's in
ElasticsearchIO; I wonder how that's implemented?
I'm keen to follow any further discussion on watermark holds for DoFn's
that implement both @StartBundle and @FinishBundle!
- Evan
On Wed, Mar 9, 2022 at 10:05 AM Evan
rs inline.
> On 3/8/22 22:00, Evan Galpin wrote:
>
> Thanks Jan for confirming that the fix looks alright. I also found a
> PR[1] that appears to be a good case study of the Timer watermark hold
> technique that you previously mentioned so I'll study that a bit for my own
> unde
ps only to future instant (and inside same window!), which has
> little semantic meaning to me. Looks more error prone than useful.
>
> Jan
> On 3/8/22 15:53, Evan Galpin wrote:
>
> Thanks Jan, it's interesting to read about the handling of timestamp in
> cases employing a buffer
, but might not be appropriate here.
>
> Currently, the only way to limit the progress of output watermark is by
> setting a timer with output timestamp that has the timestamp of the
> earliest element in the buffer. There was a thread that was discussing this
> in more details [1
validation code.
[1] https://github.com/apache/beam/pull/16744
On Mon, Mar 7, 2022 at 1:39 PM Evan Galpin wrote:
> x-post from the associated Jira ticket[0]
>
>
> Fortunately/unfortunately this same issue struck me as well, and I opened
> a PR[1] to use `ProcessContext#output`
x-post from the associated Jira ticket[0]
Fortunately/unfortunately this same issue struck me as well, and I opened a
PR[1] to use `ProcessContext#output` rather than
`ProcessContext#outputWithTimestamp`. I believe that should resolve this
issue, it has for me when running jobs with a vendored
Hey folks,
I noticed through tracing code that when calling
ProcessContext#outputWithTimestamp, the method checkTimestamp is
invoked[1]. However, no similar check appears to be invoked when calling
FinishBundleContext#output, which explicitly requires passing a timestamp
as one of the
@Matt have you tried any of the "IDE Tasks" available through gradle?
"./gradlew tasks" from beam top-level will list available tasks, and the
IDE Tasks subsection includes tasks specific to trying to bootstrap or
clean up beam project in either Eclipse or Intellij. Ex. "./gradlew idea"
should
to know this works! Thanks Luke
On Tue, Aug 10, 2021 at 1:19 PM Luke Cwik wrote:
>
>
> On Tue, Aug 10, 2021 at 10:11 AM Evan Galpin
> wrote:
>
>> Thanks for your responses Luke. One point I have confusion over:
>>
>> * Modify the sink implementation
-in IO module of
Beam, modification of the Sink may not be immediately feasible. Is the only
recourse in that case to drain a job an start a new one?
On Tue, Aug 10, 2021 at 12:54 PM Luke Cwik wrote:
>
>
> On Tue, Aug 10, 2021 at 8:54 AM Evan Galpin wrote:
>
>> Hi all,
>>
&
Hi all,
I recently had an experience where a streaming pipeline became "clogged"
due to invalid data reaching the final step in my pipeline such that the
data was causing a non-transient error when writing to my Sink. Since the
job is a streaming job, the element (bundle) was continuously
Much appreciated, that option is super helpful. Thanks again for the help!
Evan
On Fri, Jul 30, 2021 at 12:33 PM Robert Bradshaw
wrote:
> You can try using --performance_runtime_type_check which will
> hopefully help you pinpoint the bad type.
>
> On Fri, Jul 30, 2021 at 9:23 AM
roneous type declaration.
>
> On Fri, Jul 30, 2021 at 7:28 AM Evan Galpin wrote:
> >
> > Hi all,
> >
> > I wonder if anyone can shed some light on an issue I'm having with type
> checking and the Beam python SDK.
> >
> > Without changing any python co
Hi all,
I wonder if anyone can shed some light on an issue I'm having with type
checking and the Beam python SDK.
Without changing any python code, I'm finding that running my pipeline with
or without the option no_pipeline_type_check starts the pipeline without
issue; no type check errors are
Hi all,
I just ran into some broken links from
https://beam.apache.org/documentation/io/built-in/. It appears as though
links for IOs are routing to javadocs that don't exist. Ex.
https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/parquet/ParquetIO.html
Maybe javadoc site
Is there any possibility of changing the build cadence allowing for builds
released as alpha versions or similar? It’s not too uncommon for projects
to have nightly builds for example. Could that help deliver fixes more
quickly to customers, while also avoiding the nuisances mentioned in this
Hi there,
You can unsubscribe by sending an empty email to
dev-unsubscr...@beam.apache.org and likewise
user-unsubscr...@beam.apache.org
Thanks,
Evan
On Wed, Jun 2, 2021 at 11:17 Pasan Kamburugamuwa <
pasankamburugamu...@gmail.com> wrote:
> Hi,
> Please unsubscribe me
>
> Thank you
>
+1
FWIW I recently ran into the exact case you described (high serialization
cost). The solution was to implement some not-so-intuitive alternative
transforms in my case, but I would have very much appreciated faster
serialization performance.
Thanks,
Evan
On Tue, May 25, 2021 at 15:26 Stephan
It did, yes :-)
On Mon, May 24, 2021 at 21:17 Reuven Lax wrote:
> Did Java PreCommit pass on your PR?
>
> On Mon, May 24, 2021 at 5:42 PM Evan Galpin wrote:
>
>> I’m not certain that it’s related based on a quick scan of the test
>> output that you linked, but I do k
I’m not certain that it’s related based on a quick scan of the test output
that you linked, but I do know that I recently made a change[1] to
Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub
Lite[2].
Given that the change is recent and the test is failing on remote but
Any further thoughts here? Or tips on profiling Beam DirectRunner?
Thanks,
Evan
On Wed, May 12, 2021 at 6:22 PM Evan Galpin wrote:
> Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow
> behaviour regardless of use_deprecated_reads. Not sure if that points to
>
Not sure if it's on the radar but there was mention on this Reshuffle PR[1]
of cherry-picking it to 2.30.0
[1] https://github.com/apache/beam/pull/14720
Thanks,
Evan
On Thu, May 13, 2021 at 6:50 PM Heejong Lee wrote:
> UPDATE:
>
> All precommit and postcommit tests are passed now:
>
d, May 12, 2021 at 5:53 PM Evan Galpin wrote:
>
>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken
>> specifically in 2.29.0 (regression) or broken in all versions up to and
>> including 2.29.0 (ie never worked)?
>>
>> Thanks,
>> Evan
>>
; which will fix your performance issue.
>
> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote:
>
>> Hi Evan,
>>
>> It seems like the slow step is not the read that use_deprecated_read
>> targets for. Would you like to share your pipeline code if possible?
>>
>&g
step is not the read that use_deprecated_read
> targets for. Would you like to share your pipeline code if possible?
>
> On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote:
>
>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>> observed slow behavior again. I
.19.
>
> On Wed, May 12, 2021 at 2:55 PM Evan Galpin wrote:
>
>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
>> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
>> my use case at least) regardless of
Steve Niemitz wrote:
> use_deprecated_read was broken in 2.19 on the direct runner and didn't do
> anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21.
>
> [1] https://github.com/apache/beam/pull/14469
>
> On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote:
>
>&g
I forgot to also mention that in all tests I was setting
--experiments=use_deprecated_read
Thanks,
Evan
On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote:
> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
> DirectRunner slowness, not just pubsub. I have a pipeline l
:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
processElement
INFO: Got file contents for document_id my-file2.json
Any thoughts on what could be causing this?
Thanks,
Evan
On Wed, May 12, 2021 at 9:53 AM Evan Galpin wrote:
>
>
> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote:
>
7ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>
>> We should rollback using the SDF wrapper by default because of the
>> usability and performance issues reported.
>>
>>
>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin
>> wrote:
>>
>&g
Hi all,
I’m experiencing very slow performance and startup delay when testing a
pipeline locally. I’m reading data from a Google PubSub subscription as the
data source, and before each pipeline execution I ensure that data is
present in the subscription (readable from GCP console).
I’m seeing
n Fri, May 7, 2021 at 1:56 PM Brian Hulette wrote:
>
>> I suspect this was unintentional. It looks like @Daniel Collins
>> added the numBuckets parameter in
>> https://github.com/apache/beam/pull/11919, maybe they can confirm.
>>
>> Brian
>>
>> On Mon,
Thanks!
On Wed, May 5, 2021 at 23:14 Boyuan Zhang wrote:
> Hi,
> Yes, just like normal DoFn, Splittable DoFn preserves the window
> information as well.
>
> On Wed, May 5, 2021 at 8:04 PM Evan Galpin wrote:
>
>> Hi folks,
>>
>> I’d just like to confir
Hi folks,
I’d just like to confirm what happens to window assignments through a
SplittableDoFn. Are output elements automatically assigned to the same
window as input elements?
Thanks,
Evan
Hi all,
While testing for a feature I’m implementing, I noticed that
Reshuffle.AssignToShard[1] produces (N*2)-1 buckets, where N is the value
of the user-defined numBuckets parameter. This is because the value of the
variable having the remainder operator applied, hashOfShard, can be
negative.
the last saved
> checkpoint and restart processing from that point.
>
> On Sat, Apr 24, 2021 at 10:53 AM Evan Galpin
> wrote:
>
>> Thanks Reuven! I assume for other runners the semantics might differ
>> significantly?
>>
>> Do you happen to know if the
ase of Dataflow, storage is backed by a distributed storage
> system, and this storage is separate from the worker node. Crashing worker
> nodes will not cause data loss.
>
> At the present time though, the storage is tied to a single data center.
>
> Reuven
>
> On Sat, A
Hi all!
First off, I apologize for potentially dredging up a topic which has been
asked a number of times before. I’m looking for slightly more/different
info than I have seen before however:
I’ve seen in a number of StackOverflow answers[1][2][3] mention of the
phrase “durably committed” in
;> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch-tests/elasticsearch-tests-7/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOIT.java
>> [2]
>> https://github.com/apache/beam/tree/master/.test-infra/kubernetes/elasticsearch
>> [3]
>&
Hi folks!
I'm Evan, and I'm fairly new to developing the Beam SDK. I've been a user
for a number of years and have done some private SDK customizations along
the way for my day job, but have recently been given the green light to
contribute back to the OSS repo In particular, I've worked with
I ran into this same issue recently as well. It looks like you may have
already found the same, but I can say that adding execute permissions did
fix the issue for me.
If I recall, adding execute perms lead to another very similar issue where
an additional exe required the same fix. Once applied
e commands will work in the deve environment you refer to.
>>
>> As for finding the target names, I don't have the best answer here. The
>> wiki has some commands for running different types of tests
>> "ValidatesRunner", etc. IIRC these require a command with
Hey beam devs! Does anyone have any tips for running Beam unit tests and
pre-commit checks in the dockerized development environment created by
using the shell script in the repo root "start-build-env.sh"?
For PreCommit_Java, I can't seem to find a gradle task which corresponds
(as compared to
80 matches
Mail list logo