Re: Contribution of Asgarde: Error Handling for Beam?

2023-09-08 Thread Daniel Collins via dev
> > Beam is complicated by itself, and we do encourage multiple ways of > solving problems, but that says to me that having an out of repo ecosystem > is the right path, rather than incorporation. > > On Fri, Sep 8, 2023, 8:14 AM Daniel Collins via dev > wrote: > >> I think

Re: Contribution of Asgarde: Error Handling for Beam?

2023-09-08 Thread Daniel Collins via dev
I think there are a lot of interesting and relatively isolated components of the project, it might make sense to write per-transform one pagers for isolated things like the most useful pieces (just basically copying the documentation and justifying the API) instead of doing a one-shot import or

Re: [Proposal] Kaskada DSL and FnHarness for Temporal Queries

2023-06-12 Thread Daniel Collins via dev
How does this mechanism differ from beam SQL which already offers windowing via SQL over PCollections? https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/ -Daniel On Mon, Jun 12, 2023 at 3:11 PM Ryan Michael wrote: > Hello, Beam (also)! > > Just introducing

Re: KafkaIO does not make use of Kafka Consumer Groups [kafka] [java] [io]

2023-04-15 Thread Daniel Collins via dev
Why do you want to use a consumer group? If you have consumers in other jobs, your beam job will fail to receive all messages it should for the topic. > It seems the code attempts to circumvent the partition assignment mechanism provided by Kafka to use it's own. All beam I/Os for partitioned

Re: Consuming one PCollection before consuming another with Beam

2023-02-27 Thread Daniel Collins via dev
It sounds like what you're doing here might be best done outside the beam model. Instead of performing the initial computation reading from BQ into a PCollection, perform it using the BigQuery client library in the same manner as you currently do to load the data from redis. On Mon, Feb 27, 2023

Re: SparkRunner - ensure SDF output does not need to fit in memory

2023-01-02 Thread Daniel Collins via dev
0/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/SimpleParDoFn.java#L285 > [2] https://github.com/apache/beam/pull/14755 > [3] > https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17332005=com.atlassian.jira.plugin.system

Re: SparkRunner - ensure SDF output does not need to fit in memory

2022-12-28 Thread Daniel Collins via dev
I believe that for dataflow runner, the result of processElement must also fit in memory, so this is not just a constraint for the spark runner. The best approach at present might be to convert the source from a flatMap to an SDF that reads out chunks of the file at a time, and supports runner

Re: Gradle Task Configuration Avoidance

2022-12-08 Thread Daniel Collins via dev
We could probably add a lint that rejects the spelling `task("` pretty easily that would catch most of these. On Thu, Dec 8, 2022 at 11:34 AM Luke Cwik via dev wrote: > I have found the Gradle build reports very useful to enumerate > deprecations and an easier thing to look at over the command

Re: Avro Schema Evolution in Beam

2022-10-14 Thread Daniel Collins via dev
Hello, Unfortunately the avro binary protocol is incompatible with direct schema evolution as you have attempted here. Instead, to decode an avro message, you must always (per the spec ) have access to the schema used to

Force usage of runner_v2 on dataflow?

2021-10-14 Thread Daniel Collins
Hi all, I'd like to make use of SDF bundle finalizers which require runner_v2 on dataflow. Is there any way to fail a pipeline quickly if not running on runner_v2 or (better) test if the runner supports bundle finalizers? -Daniel

Aliasing Pub/Sub Lite IO in external repo

2021-06-17 Thread Daniel Collins
Hello beam developers, I'm the primary author of the Pub/Sub Lite I/O, and I'd like to get some feedback on a change to the model for hosting this I/O in beam. Our team has been frustrated by the fact that we have no way to release features or fixes for bugs to customers on time scales shorter

Re: sdk.io.gcp.pubsublite.SubscriptionPartitionLoaderTest failing

2021-05-24 Thread Daniel Collins
d on a quick scan of the test >>>> output that you linked, but I do know that I recently made a change[1] to >>>> Reshuffe.AssignToShard which @Daniel Collins mentioned was used by PubSub >>>> Lite[2]. >>>> >>>> Given that the change is recent and

Re: Potential Bug in Reshuffle.AssignToShard?

2021-05-07 Thread Daniel Collins
ing said, withNumBuckets is only used with Pub/Sub Lite, and is safe to make correct. -Daniel On Fri, May 7, 2021 at 1:56 PM Brian Hulette wrote: > I suspect this was unintentional. It looks like @Daniel Collins > added the numBuckets parameter in > https://github.com/apache/beam/pull/

Re: Error comparing flat schema against schema inferred from protoclass

2021-04-21 Thread Daniel Collins
ING NOT NULL, >>> options={{beam:option:proto:meta:number=Option{type=INT32 NOT NULL, >>> value=1 >>> Field{name=height, description=, type=INT32 NOT NULL, >>> options={{beam:option:proto:meta:number=Option{type=INT32 NOT NULL, >>> value=2 >>> Field{name=kno

Re: Error comparing flat schema against schema inferred from protoclass

2021-04-20 Thread Daniel Collins
y idea what else might be going on? > > On Tue, Apr 20, 2021 at 11:13 AM Daniel Collins > wrote: > >> Thanks for working on this! It looks to me like the schemas don't match: >> you appear to be using NameHeightMessage defined as: >> >&

Re: Error comparing flat schema against schema inferred from protoclass

2021-04-20 Thread Daniel Collins
Thanks for working on this! It looks to me like the schemas don't match: you appear to be using NameHeightMessage defined as: ``` message NameHeightMessage { string name = 1; int32 height = 2; } ``` And expecting it to work with a table schema that has a "BOOL knowsJavascript" field. Did you

Re: [Question] Amazon Neptune I/O connector

2021-04-15 Thread Daniel Collins
Hi Gabriel, Write-side adapters for systems tend to be easier than read-side adapters to implement. That being said, looking at the documentation for neptune, it looks to me like there's no direct data load API, only a batch data load from a file on S3

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Daniel Collins
proto support effort. > If it's ok with you, I'd like to implement and add the protobuf tests to > the PubSubTableProviderIT. > > Please let me know if that works for you. > > Thanks again, > - Fernando Morales > > On Thu, Mar 18, 2021 at 12:15 PM Daniel Collins > wrote: >

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Daniel Collins
Also see https://github.com/apache/beam/pull/13838 which is related On Thu, Mar 18, 2021 at 3:06 PM Daniel Collins wrote: > > appears to be dead code > > Its not- its used as the format for the configuration row here > https://github.com/apache/beam/blob/3fc2ab10d9f5d5c5b65ecf94c

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Daniel Collins
> > On Thu, Mar 18, 2021 at 11:46 AM Daniel Collins > wrote: > >> > because there are a couple places where Avro, JSON are still hard-coded >> [2,3] >> >> This is not a blocker, its due to the fact that PubsubTableProvider is >> just a wrapper for PubsubS

Re: BEAM-10884: Add Proto support to Pubsub table provider

2021-03-18 Thread Daniel Collins
18, 2021 at 2:44 PM Brian Hulette wrote: > Hi Fernando, > > Daniel Collins actually added the PayloadSerializerProvider concept very > recently [1], which is why it looks like Piotr's code doesn't apply > anymore. But the good news is I think that PR gets this task pretty clo

Re: Java Tests are failing on Github checks

2021-03-08 Thread Daniel Collins
Hello all, Can we make progress on either updating the version or rolling back https://github.com/apache/beam/commit/dfeda2ab1dbebd8446766fae0cafb314ec29920f? This is blocking any and all PRs from being submitted. Thanks, Daniel On Thu, Mar 4, 2021 at 6:52 PM Shehzaad Nakhoda wrote: > Hi

Re: BEAM-6855

2021-02-18 Thread Daniel Collins
The getState function should be static, sorry. "synchronized static @NotNull MyState getState()" On Thu, Feb 18, 2021 at 3:41 PM Daniel Collins wrote: > > On every dataflow start, I want to read from CloudSQL and build the cache > > If you do this outside of dataflow, you

Re: BEAM-6855

2021-02-18 Thread Daniel Collins
> On every dataflow start, I want to read from CloudSQL and build the cache If you do this outside of dataflow, you can use a static to do this on every worker start. Is that what you're looking for? For example: final class StateLoader { private StateLoader() {} @GuardedBy("this")

Re: Not running tests that need DirectRunner in core?

2021-02-05 Thread Daniel Collins
Ah looks like its a bit harder than this. But I'll try to make it work for building all of beam locally then get back to you with a working PR. On Fri, Feb 5, 2021 at 1:07 PM Daniel Collins wrote: > https://github.com/apache/beam/pull/13903 > > On Fri, Feb 5, 2021 at 12:48 PM Robert

Re: Not running tests that need DirectRunner in core?

2021-02-05 Thread Daniel Collins
lf (which was cyclic). I would be super happy if > this got fixed; having to jump through hoops to run the majority of core > tests is a constant source of surprise. > > On Fri, Feb 5, 2021 at 9:34 AM Daniel Collins > wrote: > >> Hello all, >> >> It looks like no

Not running tests that need DirectRunner in core?

2021-02-05 Thread Daniel Collins
Hello all, It looks like no tests in sdks/java/core will run if they need a runner (They'll use CrashingRunner instead), and if they are annotated with @Category(NeedsRunner.class), they will be excluded from the test rule. Does anyone know the reason for this? It appears rather old (> 3 years),

Re: Use case for reading to dynamic Pub/Sub subscriptions?

2020-12-08 Thread Daniel Collins
/s/Combine/Flatten On Tue, Dec 8, 2020 at 8:06 PM Daniel Collins wrote: > Hi all, > > I'm trying to figure out if there's any possible use for reading from a > dynamic set of Pub/Sub [Lite] subscriptions in a beam pipeline, although > the same logic would apply to kafka topics. D

Use case for reading to dynamic Pub/Sub subscriptions?

2020-12-08 Thread Daniel Collins
Hi all, I'm trying to figure out if there's any possible use for reading from a dynamic set of Pub/Sub [Lite] subscriptions in a beam pipeline, although the same logic would apply to kafka topics. Does anyone know of a use case where you'd want to apply the same set of processing logic to all

Re: Any interest in sharding targets?

2020-12-01 Thread Daniel Collins
t; the module you are editing, that's worth discussing and probably fixing. > > That's going to be a separate issue from wanting to build a single part of > the GCP IO package without building the rest of the package. Details and > questions below. > > On Mon, Nov 30, 2020 at 4:3

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
for performing the work that the primary restriction * represents. On Sat, Nov 28, 2020 at 2:45 AM Daniel Collins wrote: > This does not appear to work: The CheckDone call, as far as I can tell, is > made on the existing range not the split range based on the following error: > > Error

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
work in [4601979, 9223372036854775807) was not attempted org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:862) On Sat, Nov 28, 2020 at 2:18 AM Daniel Collins wrote: > Can you confirm that the following implementation of trySplit will w

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
would nice to update the documentation if that's confusing. > > On Fri, Nov 27, 2020 at 11:05 PM Daniel Collins > wrote: > >> I think the documentation for trySplit() doesn't make it clear that it >> supports this use case. In particular this section: >> >

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
gt;> tryClaim() to yield control back >>> to the runtime? >> >> >> You can do so by implementing RestrictionTracker.trySplit() and using >> resume(). >> >> You may also want to take a look at Kafka example[1]. Hope that is >> helpful. >> &g

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
ame offset when you call currentRestriction().getFrom(). > > > [1] > https://beam.apache.org/documentation/programming-guide/#splittable-dofns > [2] > https://beam.apache.org/documentation/programming-guide/#user-initiated-checkpoint > > On Fri, Nov 27, 2020 at 10:07 PM Dan

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
ing GrowableOffsetRangeTracker[1] to track an > infinite restriction. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/GrowableOffsetRangeTracker.java > > On Fri, Nov 27, 2020 at 9:50 PM Daniel Collins >

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
the sdf and restriction tracker implementations) -Dan On Sat, Nov 28, 2020 at 12:42 AM Boyuan Zhang wrote: > > > On Fri, Nov 27, 2020 at 9:34 PM Daniel Collins > wrote: > >> Hello Boyuan, >> >> Responses inline. >> >> > The checkDone is invoked b

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
ableDoFn initiated > checkpoint to defer processing restriction [3188439, 9223372036854775807), > or you should return stop() only when you have tryClaim() return False. > > > On Fri, Nov 27, 2020 at 9:07 PM Daniel Collins > wrote: > >> Hello all, >> >> I'm trying to

CheckDone in unbounded SDF?

2020-11-27 Thread Daniel Collins
Hello all, I'm trying to convert PubSubLiteIO into an SDF. I'm running into the following error on dataflow with a RestrictionTracker returning UNBOUNDED to isBounded. It looks like calls are being made to `checkDone`, but looking at the documentation of `checkDone`, I don't think there's any

Re: Individual Parallelism support for Flink Runner

2020-06-27 Thread Daniel Collins
I could imagine for example, a 'parallelismHint' field in the base parameters that could be set to maxNumWorkers when running on dataflow or an equivalent parameter when running on flink. It would be useful to get a default value for the sharding in the Reshuffle changes here

Re: Problems with underlying client library types in beam?

2020-06-27 Thread Daniel Collins
cy management but both of these efforts are still relatively > new. > > On Thu, Jun 25, 2020 at 9:28 AM Daniel Collins > wrote: > >> Hello all, >> >> In https://github.com/apache/beam/pull/11919, there is a concern about >> the fact that the underlying clie

Problems with underlying client library types in beam?

2020-06-25 Thread Daniel Collins
Hello all, In https://github.com/apache/beam/pull/11919, there is a concern about the fact that the underlying client library's Message class uses guava's ListMultimap to represent a String -> Collection mapping, and this type is exposed by the interface. A couple of questions: 1) I'd like to

Unable to get java presubmit to run (not pass)

2020-02-11 Thread Daniel Collins
Hello beam developers, I've been trying to rerun presubmits on https://github.com/apache/beam/pull/10476 quite a few times, but it keeps stalling out at the "Java ("Run Java PreCommit") Pending — Build triggered for merge commit." Is there currently a problem with the jenkins cluster? Thanks!

Retest this please access?

2020-02-10 Thread Daniel Collins
Hello all, I'm feeling a bit bad about asking my reviewers to re-run presubmits. How would I go about getting access to "Retest this please" being interpreted as a presubmit trigger on github? Thanks, Daniel

Hello Beam Developers!

2019-12-09 Thread Daniel Collins
My name is Daniel, I'm a developer on the Cloud Pub/Sub Team at Google in New York. I'm looking to make some contributions to the Cloud Pub/Sub integration with beam. I was hoping that I could be added to the JIRA as a contributor. Looking forward to working with everyone!