Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-20 Thread Evan Galpin
sal: is there something that stops us from using >>> RowCoder by default here? I assume all forms of "PubsubMessage" can be >>> encoded with RowCoder, it provides flexibility for future changes, and >>> PubSub will be able to take advantage of future work to make Row

Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Evan Galpin
Bump  Any other risks or drawbacks associated with altering the default coder for PubsubMessage to be the most inclusive coder with respect to possible fields? Thanks, Evan On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin wrote: > Hi folks, > > I'd like to solicit feedback on t

[Proposal] Change to Default PubsubMessage Coder

2022-12-12 Thread Evan Galpin
Hi folks, I'd like to solicit feedback on the notion of using PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the default coder for Pubsub messages instead of the current default of PubsubMessageWithAttributesCoder. Not long ago, support for reading and writing Pubsub messages

Re: [Question][Dataflow][Java][pubsub] Streaming Pipeline Stall Scenarios

2022-11-17 Thread Evan Galpin
friendly bump  Anyone have thoughts or answers? Thanks! On Thu, Nov 3, 2022 at 3:07 PM Evan Galpin wrote: > Hi folks, > > Hoping to get some definitive answers with respect to streaming pipeline > bundle retry semantics on Dataflow. I understand that a bundle containing >

[Question][Dataflow][Java][pubsub] Streaming Pipeline Stall Scenarios

2022-11-03 Thread Evan Galpin
Hi folks, Hoping to get some definitive answers with respect to streaming pipeline bundle retry semantics on Dataflow. I understand that a bundle containing a "poison pill" (bad data, let's say it causes a null pointer exception when processing in DoFn) will be retried indefinitely. What I'm

Re: [question] Good Course to learn beam

2022-08-30 Thread Evan Galpin
+dev for additional visibility/input On Mon, Aug 29, 2022 at 11:10 AM Leandro Nahabedian via user < u...@beam.apache.org> wrote: > Hi community! > > I'm looking for a good course to learn apache beam and I saw this one >

Re: [ANNOUNCE] New committer: John Casey

2022-07-29 Thread Evan Galpin
Congrats John! On Fri, Jul 29, 2022 at 17:18 Ahmet Altay via dev wrote: > Congratulations John! > > On Fri, Jul 29, 2022 at 2:03 PM Svetak Sundhar via dev < > dev@beam.apache.org> wrote: > >> Congrats John!! >> >> >> Svetak Sundhar >> >> Technical Solutions Engineer, Data >> s

Re: [Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-23 Thread Evan Galpin
One final note of clarification: the pom file needs to be in the same directory as the jar On Fri, Jul 22, 2022 at 11:01 Evan Galpin wrote: > It's working! Huge thank you to Steve Niemitz who pointed out the need for > "--experiments=enable_custom_pubsub_sink" to prevent d

Re: [Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-22 Thread Evan Galpin
HOT") } } } 8. Deploy the user code pipeline including the flag: --experiments=enable_custom_pubsub_sink On Thu, Jul 21, 2022 at 4:42 PM Evan Galpin wrote: > Thanks Tomo, I'll check that out too as a good safeguard! Are you > familiar with any process to build pre-release

Re: Extending 2.41.0 Java snapshot TTL

2022-07-21 Thread Evan Galpin
Admittedly this is potentially self-serving, but I feel there could be mutual benefit. I have a similar situation where I want to use pre-release version of beam-sdks-java-io-google-cloud-platform. Though I’ve been having trouble doing so, a possible alternative solution to using the nightly

Re: [Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-21 Thread Evan Galpin
t; > On Thu, Jul 21, 2022 at 3:35 PM Evan Galpin wrote: > >> Spoke too soon... still can't seem to get the new behaviour to appear in >> dataflow, possibly something is being overridden? >> >> On Thu, Jul 21, 2022 at 3:15 PM Evan Galpin wrote: >> >>> M

Re: [Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-21 Thread Evan Galpin
Spoke too soon... still can't seem to get the new behaviour to appear in dataflow, possibly something is being overridden? On Thu, Jul 21, 2022 at 3:15 PM Evan Galpin wrote: > Making a shadowJar from "beam-sdks-java-io-google-cloud-platform" looks to > be work

Re: [Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-21 Thread Evan Galpin
en deploying the job to dataflow. Looks ok. On Thu, Jul 21, 2022 at 3:02 PM Evan Galpin wrote: > I believe I have the dependencySubstitution working, but it seems as > though the substitution is removing transitive deps of > "beam-sdks-java-io-google-cloud-platform", hmm... &

Re: [Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-21 Thread Evan Galpin
I believe I have the dependencySubstitution working, but it seems as though the substitution is removing transitive deps of "beam-sdks-java-io-google-cloud-platform", hmm... On Thu, Jul 21, 2022 at 1:15 PM Evan Galpin wrote: > Hi all, > > I'm trying to test a chang

[Dataflow][Guidance] Replacing beam-sdks-java-io-google-cloud-platform with local jar

2022-07-21 Thread Evan Galpin
Hi all, I'm trying to test a change I've made locally, but by validating it on Dataflow. It works locally, but I want to validate on Dataflow. I've tried a few different attempts at module substitution in the build.gradle config file for the pipeline I'm trying to deploy, but I haven't had any

Re: [ANNOUNCE] New committer: Steven Niemitz

2022-07-20 Thread Evan Galpin
Congrats! Well deserved! On Wed, Jul 20, 2022 at 15:17 Chamikara Jayalath via dev < dev@beam.apache.org> wrote: > Congrats, Steve! > > On Wed, Jul 20, 2022, 9:16 AM Austin Bennett > wrote: > >> Great! >> >> On Wed, Jul 20, 2022 at 10:11 AM Aizhamal Nurmamat kyzy < >> aizha...@apache.org> wrote:

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2022-07-08 Thread Evan Galpin
th GCP support. It could be something >>> internal but the coders for ephemeral steps that Dataflow adds are based >>> upon existing coders within the graph. >>> >>> On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin wrote: >>> >>>> +dev@ >>&g

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2022-07-08 Thread Evan Galpin
do an update > to get the latest version? > > Feel free to share the job files with GCP support. It could be something > internal but the coders for ephemeral steps that Dataflow adds are based > upon existing coders within the graph. > > On Tue, Jul 5, 2022 at 8:03 AM Evan Galpin

Re: [Dataflow][Java] Guidance on Transform Mapping Streaming Update

2022-07-05 Thread Evan Galpin
t; is not present in the job file, maybe an internal Dataflow thing?), so I'm confident that the coder hasn't actually changed. I'm not sure how to proceed in updating the running pipeline, and I'd really prefer not to drain. Any ideas? Thanks, Evan On Fri, Oct 22, 2021 at 3:36 PM Evan Galpin

Re: Chained Job Graph Apache Beam | Dataflow

2022-06-15 Thread Evan Galpin
It may also be helpful to explore CoGroupByKey as a way of joining data, though depending on the shape of the data doing so may not fit in mem: https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/ - Evan On Wed, Jun 15, 2022 at 3:45 PM Bruno Volpato wrote: > Hello, >

Re: [PROPOSAL] Preparing for Beam release 2.40.0

2022-06-15 Thread Evan Galpin
On Wed, Jun 15, 2022 at 1:46 PM Ahmet Altay wrote: > > > On Wed, Jun 15, 2022 at 10:39 AM Evan Galpin wrote: > >> RE https://github.com/apache/beam/issues/21690, I've just opened a PR >> with a fix: https://github.com/apache/beam/pull/21895. If that feels >> too

Re: [PROPOSAL] Preparing for Beam release 2.40.0

2022-06-15 Thread Evan Galpin
RE https://github.com/apache/beam/issues/21690, I've just opened a PR with a fix: https://github.com/apache/beam/pull/21895. If that feels too rushed, I've also opened a PR to revert the change that introduced the regression in the first place: https://github.com/apache/beam/pull/21884. For

Re: Null PCollection errors in v2.40 unit tests

2022-06-14 Thread Evan Galpin
I had this happen to me recently as well. After `git bisecting` led to confusing results, I ran my tests again via gradlew adding `--rerun-tasks` to the command. This is an expensive operation, but after I ran that I was able to test again with expected results. YMMV Thanks, Evan On Tue, Jun

Re: [ANNOUNCE] New committer: Ke Wu

2022-05-30 Thread Evan Galpin
Congrats Ke! - Evan On Mon, May 30, 2022 at 4:11 AM Jan Lukavský wrote: > Congrats Ke! > > Jan > On 5/29/22 04:12, Yi Pan wrote: > > Congrats, Ke! > > -Yi > > On Sat, May 28, 2022 at 6:57 PM Robert Burke wrote: > >> Congratulations! >> Another place that runs the Go SDK ;) >> >> On Fri, May

Re: Timer bug in 2.37 around output timestamps?

2022-04-01 Thread Evan Galpin
I believe that this thread is entirely related to an another thread[1] where there is discussion that the correct fix for this issue could be to enforce that watermark updates would only happen at bundle boundaries. There’s another related thread[2] citing the same error with ElasticsearchIO.

Re: Updating output watermark on bundle boundaries

2022-03-30 Thread Evan Galpin
Robert Bradshaw wrote: > > On Mon, Mar 28, 2022 at 11:45 AM Jan Lukavský wrote: > >> On 3/28/22 20:17, Reuven Lax wrote: > >> > >> On Mon, Mar 28, 2022 at 11:08 AM Robert Bradshaw > wrote: > >>> On Mon, Mar 28, 2022 at 11:04 AM Reuven Lax wrote: &g

Re: Updating output watermark on bundle boundaries

2022-03-28 Thread Evan Galpin
updating watermarks >> > only at bundle boundaries. This seems perfectly legal for a pure 1:1 >> > mapping DoFn. The issue is that DoFns are allowed to buffer data and >> > emit them in a later process (or finishBundle). If the watermark has >> > moved on, that may result in l

Re: Updating output watermark on bundle boundaries

2022-03-28 Thread Evan Galpin
ly legal for a pure 1:1 > > mapping DoFn. The issue is that DoFns are allowed to buffer data and > > emit them in a later process (or finishBundle). If the watermark has > > moved on, that may result in late data. We don't really have a way for > > a DoFn to declare *it's* output

Re: Remove support for Elasticsearch 5 and 6 ?

2022-03-28 Thread Evan Galpin
kes sense to explicitly deprecate for a while pending removal. > Logging a warning is a good idea. > > On Fri, Mar 25, 2022 at 10:02 AM Evan Galpin > wrote: > > > > I'm +1 on removing Elasticsearch 5.x support which has not been > supported by elastic.co for 3 years

Re: Remove support for Elasticsearch 5 and 6 ?

2022-03-25 Thread Evan Galpin
I'm +1 on removing Elasticsearch 5.x support which has not been supported by elastic.co for 3 years according to the link provided. It looks like support for the most recent version of 6.x was only dropped 1 month ago according to the same data. I'd say given the recency, we should continue to

Re: Updating output watermark on bundle boundaries

2022-03-24 Thread Evan Galpin
Thanks for starting this thread Jan, I'm keen to hear thoughts and outcomes! I thought I would mention that answers to the questions posed here will help to unblock a 2.38.0 release blocker[1]. [1] https://issues.apache.org/jira/browse/BEAM-14064 On Thu, Mar 24, 2022 at 5:28 AM Jan Lukavský

Re: Possible bug in ElasticsearchIO

2022-03-11 Thread Evan Galpin
ents of a singular bundle are processed[2]" > Within the lifetime of a single bundle, you don't need to worry about > holding the watermark back as the runners do that for you. [1] https://lists.apache.org/thread/7m1gx1o7br7jfblhz4yl7czwhhgc7ml3 [2] https://lists.apache.org/thread/fnrlw7rkcg4of5

Re: Possible bug in ElasticsearchIO

2022-03-11 Thread Evan Galpin
t; and the StatefulBulkIOFn would process the iterable as a batch that is > sent to ElasticsearchIO. > > 1: > https://github.com/apache/beam/blob/c76a1c8361b83dd85b6edf0eddd5ebe03873e0ce/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reify.java#L220 > > > &

Re: [ANNOUNCE] New committer: Moritz Mack

2022-03-11 Thread Evan Galpin
Congrats Moritz! On Fri, Mar 11, 2022 at 3:05 AM Etienne Chauchot wrote: > Congrats Moritz ! Well deserved ! > > Etienne > Le 10/03/2022 à 19:44, Sachin Agarwal a écrit : > > Congratulations Moritz! > > On Thu, Mar 10, 2022 at 10:44 AM Alexey Romanenko < > aromanenko@gmail.com> wrote: > >>

Re: [ANNOUNCE] New committer: Evan Galpin

2022-03-10 Thread Evan Galpin
n me and the rest of the Beam PMC in welcoming a new committer: >> Evan Galpin >> >> Since joining the Beam community Evan has done lots of contributions to >> IOs mainly Elasticsearch, but also to SDK transforms. He also gave support >> on the ML and tested releases. >&g

Re: Possible bug in ElasticsearchIO

2022-03-10 Thread Evan Galpin
et around an implementation detail that is >> sub-optimal within Dataflow's GroupIntoBatches implementation. >> >> On Thu, Mar 10, 2022 at 6:26 AM Kenneth Knowles wrote: >> >>> >>> >>> On Wed, Mar 9, 2022 at 10:02 AM Evan Galpin >>> wrote:

Re: Possible bug in ElasticsearchIO

2022-03-09 Thread Evan Galpin
/github.com/apache/beam/blob/a126adbc6aa73f1e30adfa65a3710f7f69a7ba89/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java#L313-L318 On Wed, Mar 9, 2022 at 10:13 AM Evan Galpin wrote: > Oops, unfortunate misfire before being done writing

Re: Possible bug in ElasticsearchIO

2022-03-09 Thread Evan Galpin
g) also has a notion of "getFailedWrites" with the same intent as what's in ElasticsearchIO; I wonder how that's implemented? I'm keen to follow any further discussion on watermark holds for DoFn's that implement both @StartBundle and @FinishBundle! - Evan On Wed, Mar 9, 2022 at 10:05 AM Evan

Re: Possible bug in ElasticsearchIO

2022-03-09 Thread Evan Galpin
rs inline. > On 3/8/22 22:00, Evan Galpin wrote: > > Thanks Jan for confirming that the fix looks alright. I also found a > PR[1] that appears to be a good case study of the Timer watermark hold > technique that you previously mentioned so I'll study that a bit for my own > unde

Re: Possible bug in ElasticsearchIO

2022-03-08 Thread Evan Galpin
ps only to future instant (and inside same window!), which has > little semantic meaning to me. Looks more error prone than useful. > > Jan > On 3/8/22 15:53, Evan Galpin wrote: > > Thanks Jan, it's interesting to read about the handling of timestamp in > cases employing a buffer

Re: Possible bug in ElasticsearchIO

2022-03-08 Thread Evan Galpin
, but might not be appropriate here. > > Currently, the only way to limit the progress of output watermark is by > setting a timer with output timestamp that has the timestamp of the > earliest element in the buffer. There was a thread that was discussing this > in more details [1

Re: Possible bug in ElasticsearchIO

2022-03-07 Thread Evan Galpin
validation code. [1] https://github.com/apache/beam/pull/16744 On Mon, Mar 7, 2022 at 1:39 PM Evan Galpin wrote: > x-post from the associated Jira ticket[0] > > > Fortunately/unfortunately this same issue struck me as well, and I opened > a PR[1] to use `ProcessContext#output`

Re: Possible bug in ElasticsearchIO

2022-03-07 Thread Evan Galpin
x-post from the associated Jira ticket[0] Fortunately/unfortunately this same issue struck me as well, and I opened a PR[1] to use `ProcessContext#output` rather than `ProcessContext#outputWithTimestamp`. I believe that should resolve this issue, it has for me when running jobs with a vendored

Timestamp Verification when Outputting from FinishBundleContext Vs. ProcessContext

2022-02-02 Thread Evan Galpin
Hey folks, I noticed through tracing code that when calling ProcessContext#outputWithTimestamp, the method checkTimestamp is invoked[1]. However, no similar check appears to be invoked when calling FinishBundleContext#output, which explicitly requires passing a timestamp as one of the

Re: IO Connector

2021-10-12 Thread Evan Galpin
@Matt have you tried any of the "IDE Tasks" available through gradle? "./gradlew tasks" from beam top-level will list available tasks, and the IDE Tasks subsection includes tasks specific to trying to bootstrap or clean up beam project in either Eclipse or Intellij. Ex. "./gradlew idea" should

Re: [Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
to know this works! Thanks Luke  On Tue, Aug 10, 2021 at 1:19 PM Luke Cwik wrote: > > > On Tue, Aug 10, 2021 at 10:11 AM Evan Galpin > wrote: > >> Thanks for your responses Luke. One point I have confusion over: >> >> * Modify the sink implementation

Re: [Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
-in IO module of Beam, modification of the Sink may not be immediately feasible. Is the only recourse in that case to drain a job an start a new one? On Tue, Aug 10, 2021 at 12:54 PM Luke Cwik wrote: > > > On Tue, Aug 10, 2021 at 8:54 AM Evan Galpin wrote: > >> Hi all, >> &

[Dataflow][Java][2.30.0] Best practice for clearing stuck data in streaming pipeline

2021-08-10 Thread Evan Galpin
Hi all, I recently had an experience where a streaming pipeline became "clogged" due to invalid data reaching the final step in my pipeline such that the data was causing a non-transient error when writing to my Sink. Since the job is a streaming job, the element (bundle) was continuously

Re: Python pipeline type check (--no_pipeline_type_check) alters runtime behaviour

2021-07-30 Thread Evan Galpin
Much appreciated, that option is super helpful. Thanks again for the help! Evan On Fri, Jul 30, 2021 at 12:33 PM Robert Bradshaw wrote: > You can try using --performance_runtime_type_check which will > hopefully help you pinpoint the bad type. > > On Fri, Jul 30, 2021 at 9:23 AM

Re: Python pipeline type check (--no_pipeline_type_check) alters runtime behaviour

2021-07-30 Thread Evan Galpin
roneous type declaration. > > On Fri, Jul 30, 2021 at 7:28 AM Evan Galpin wrote: > > > > Hi all, > > > > I wonder if anyone can shed some light on an issue I'm having with type > checking and the Beam python SDK. > > > > Without changing any python co

Python pipeline type check (--no_pipeline_type_check) alters runtime behaviour

2021-07-30 Thread Evan Galpin
Hi all, I wonder if anyone can shed some light on an issue I'm having with type checking and the Beam python SDK. Without changing any python code, I'm finding that running my pipeline with or without the option no_pipeline_type_check starts the pipeline without issue; no type check errors are

Re: [RESULT][VOTE] Release 2.31.0, release candidate #1

2021-07-15 Thread Evan Galpin
Hi all, I just ran into some broken links from https://beam.apache.org/documentation/io/built-in/. It appears as though links for IOs are routing to javadocs that don't exist. Ex. https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/parquet/ParquetIO.html Maybe javadoc site

Re: Aliasing Pub/Sub Lite IO in external repo

2021-07-01 Thread Evan Galpin
Is there any possibility of changing the build cadence allowing for builds released as alpha versions or similar? It’s not too uncommon for projects to have nightly builds for example. Could that help deliver fixes more quickly to customers, while also avoiding the nuisances mentioned in this

Re: Unsuscribe

2021-06-02 Thread Evan Galpin
Hi there, You can unsubscribe by sending an empty email to dev-unsubscr...@beam.apache.org and likewise user-unsubscr...@beam.apache.org Thanks, Evan On Wed, Jun 2, 2021 at 11:17 Pasan Kamburugamuwa < pasankamburugamu...@gmail.com> wrote: > Hi, > Please unsubscribe me > > Thank you >

Re: Out of band pickling in Python (pickle5)

2021-05-25 Thread Evan Galpin
+1 FWIW I recently ran into the exact case you described (high serialization cost). The solution was to implement some not-so-intuitive alternative transforms in my case, but I would have very much appreciated faster serialization performance. Thanks, Evan On Tue, May 25, 2021 at 15:26 Stephan

Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Evan Galpin
It did, yes :-) On Mon, May 24, 2021 at 21:17 Reuven Lax wrote: > Did Java PreCommit pass on your PR? > > On Mon, May 24, 2021 at 5:42 PM Evan Galpin wrote: > >> I’m not certain that it’s related based on a quick scan of the test >> output that you linked, but I do k

Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Evan Galpin
I’m not certain that it’s related based on a quick scan of the test output that you linked, but I do know that I recently made a change[1] to Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub Lite[2]. Given that the change is recent and the test is failing on remote but

Re: Extremely Slow DirectRunner

2021-05-14 Thread Evan Galpin
Any further thoughts here? Or tips on profiling Beam DirectRunner? Thanks, Evan On Wed, May 12, 2021 at 6:22 PM Evan Galpin wrote: > Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow > behaviour regardless of use_deprecated_reads. Not sure if that points to >

Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-05-14 Thread Evan Galpin
Not sure if it's on the radar but there was mention on this Reshuffle PR[1] of cherry-picking it to 2.30.0 [1] https://github.com/apache/beam/pull/14720 Thanks, Evan On Thu, May 13, 2021 at 6:50 PM Heejong Lee wrote: > UPDATE: > > All precommit and postcommit tests are passed now: >

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
d, May 12, 2021 at 5:53 PM Evan Galpin wrote: > >> Ah ok thanks for that. Do you mean use_deprecated_reads is broken >> specifically in 2.29.0 (regression) or broken in all versions up to and >> including 2.29.0 (ie never worked)? >> >> Thanks, >> Evan >>

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
; which will fix your performance issue. > > On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: > >> Hi Evan, >> >> It seems like the slow step is not the read that use_deprecated_read >> targets for. Would you like to share your pipeline code if possible? >> >&g

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
step is not the read that use_deprecated_read > targets for. Would you like to share your pipeline code if possible? > > On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavior again. I

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
.19. > > On Wed, May 12, 2021 at 2:55 PM Evan Galpin wrote: > >> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" >> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for >> my use case at least) regardless of

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
Steve Niemitz wrote: > use_deprecated_read was broken in 2.19 on the direct runner and didn't do > anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. > > [1] https://github.com/apache/beam/pull/14469 > > On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > >&g

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
I forgot to also mention that in all tests I was setting --experiments=use_deprecated_read Thanks, Evan On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote: > Hmm, I think I spoke too soon. I'm still seeing an issue of overall > DirectRunner slowness, not just pubsub. I have a pipeline l

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Any thoughts on what could be causing this? Thanks, Evan On Wed, May 12, 2021 at 9:53 AM Evan Galpin wrote: > > > On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote: >

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
7ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >> >> We should rollback using the SDF wrapper by default because of the >> usability and performance issues reported. >> >> >> On Sat, May 8, 2021 at 12:57 AM Evan Galpin >> wrote: >> >&g

Extremely Slow DirectRunner

2021-05-07 Thread Evan Galpin
Hi all, I’m experiencing very slow performance and startup delay when testing a pipeline locally. I’m reading data from a Google PubSub subscription as the data source, and before each pipeline execution I ensure that data is present in the subscription (readable from GCP console). I’m seeing

Re: Potential Bug in Reshuffle.AssignToShard?

2021-05-07 Thread Evan Galpin
n Fri, May 7, 2021 at 1:56 PM Brian Hulette wrote: > >> I suspect this was unintentional. It looks like @Daniel Collins >> added the numBuckets parameter in >> https://github.com/apache/beam/pull/11919, maybe they can confirm. >> >> Brian >> >> On Mon,

Re: Window Assignment Across SplittableDoFn

2021-05-06 Thread Evan Galpin
Thanks! On Wed, May 5, 2021 at 23:14 Boyuan Zhang wrote: > Hi, > Yes, just like normal DoFn, Splittable DoFn preserves the window > information as well. > > On Wed, May 5, 2021 at 8:04 PM Evan Galpin wrote: > >> Hi folks, >> >> I’d just like to confir

Window Assignment Across SplittableDoFn

2021-05-05 Thread Evan Galpin
Hi folks, I’d just like to confirm what happens to window assignments through a SplittableDoFn. Are output elements automatically assigned to the same window as input elements? Thanks, Evan

Potential Bug in Reshuffle.AssignToShard?

2021-05-03 Thread Evan Galpin
Hi all, While testing for a feature I’m implementing, I noticed that Reshuffle.AssignToShard[1] produces (N*2)-1 buckets, where N is the value of the user-defined numBuckets parameter. This is because the value of the variable having the remainder operator applied, hashOfShard, can be negative.

Re: How Durable is “durably committed” Data?

2021-04-24 Thread Evan Galpin
the last saved > checkpoint and restart processing from that point. > > On Sat, Apr 24, 2021 at 10:53 AM Evan Galpin > wrote: > >> Thanks Reuven! I assume for other runners the semantics might differ >> significantly? >> >> Do you happen to know if the

Re: How Durable is “durably committed” Data?

2021-04-24 Thread Evan Galpin
ase of Dataflow, storage is backed by a distributed storage > system, and this storage is separate from the worker node. Crashing worker > nodes will not cause data loss. > > At the present time though, the storage is tied to a single data center. > > Reuven > > On Sat, A

How Durable is “durably committed” Data?

2021-04-24 Thread Evan Galpin
Hi all! First off, I apologize for potentially dredging up a topic which has been asked a number of times before. I’m looking for slightly more/different info than I have seen before however: I’ve seen in a number of StackOverflow answers[1][2][3] mention of the phrase “durably committed” in

Re: [QUESTION] Dockerized Integration Tests with Java/Gradle

2021-04-22 Thread Evan Galpin
;> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch-tests/elasticsearch-tests-7/src/test/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIOIT.java >> [2] >> https://github.com/apache/beam/tree/master/.test-infra/kubernetes/elasticsearch >> [3] >&

[QUESTION] Dockerized Integration Tests with Java/Gradle

2021-04-22 Thread Evan Galpin
Hi folks! I'm Evan, and I'm fairly new to developing the Beam SDK. I've been a user for a number of years and have done some private SDK customizations along the way for my day job, but have recently been given the green light to contribute back to the OSS repo  In particular, I've worked with

Re: protoc issues in docker container

2021-04-09 Thread Evan Galpin
I ran into this same issue recently as well. It looks like you may have already found the same, but I can say that adding execute permissions did fix the issue for me. If I recall, adding execute perms lead to another very similar issue where an additional exe required the same fix. Once applied

Re: Beam unit tests and Pre-commit

2021-04-05 Thread Evan Galpin
e commands will work in the deve environment you refer to. >> >> As for finding the target names, I don't have the best answer here. The >> wiki has some commands for running different types of tests >> "ValidatesRunner", etc. IIRC these require a command with

Beam unit tests and Pre-commit

2021-04-05 Thread Evan Galpin
Hey beam devs! Does anyone have any tips for running Beam unit tests and pre-commit checks in the dockerized development environment created by using the shell script in the repo root "start-build-env.sh"? For PreCommit_Java, I can't seem to find a gradle task which corresponds (as compared to