Re: Any recomendation for key for GroupIntoBatches

2024-04-15 Thread Robert Bradshaw via user
On Fri, Apr 12, 2024 at 1:39 PM Ruben Vargas wrote: > On Fri, Apr 12, 2024 at 2:17 PM Jaehyeon Kim wrote: > > > > Here is an example from a book that I'm reading now and it may be > applicable. > > > > JAVA - (id.hashCode() & Integer.MAX_VALUE) % 100 > > PYTHON - ord(id[0]) % 100 > or

Re: Hot update in dataflow without lossing messages

2024-04-15 Thread Robert Bradshaw via user
Are you draining[1] your pipeline or simply canceling it and starting a new one? Draining should close open windows and attempt to flush all in-flight data before shutting down. For PubSub you may also need to read from subscriptions rather than topics to ensure messages are processed by either

Re: Fails to run two multi-language pipelines locally?

2024-03-08 Thread Robert Bradshaw via user
an error. My speculation is the containers don't recognise each >> other and get killed by the Flink task manager. I see containers are kept >> created and killed. >> >> Does every multi-language pipeline runs in a separate container? >> >> On Thu, 7 Mar 2024, 12

Re: [Question] Python Streaming Pipeline Support

2024-03-08 Thread Robert Bradshaw via user
The Python Local Runner has limited support for streaming pipelines. For the time being would recommend using Dataflow or Flink (the latter can be run locally) to try out streaming pipelines. On Fri, Mar 8, 2024 at 2:11 PM Puertos tavares, Jose J (Canada) via user wrote: > > Hello Hu: > > > >

Re: Fails to run two multi-language pipelines locally?

2024-03-06 Thread Robert Bradshaw via user
flink runner. A flink cluster > is started locally. > > On Thu, 7 Mar 2024 at 12:13, Robert Bradshaw via user > wrote: >> >> Streaming portable pipelines are not yet supported on the Python local >> runner. >> >> On Wed, Mar 6, 2024 at 5:03 PM Jaehy

Re: Fails to run two multi-language pipelines locally?

2024-03-06 Thread Robert Bradshaw via user
Streaming portable pipelines are not yet supported on the Python local runner. On Wed, Mar 6, 2024 at 5:03 PM Jaehyeon Kim wrote: > > Hello, > > I use the python SDK and my pipeline reads messages from Kafka and transforms > via SQL. I see two containers are created but it seems that they don't

Re: Roadmap of Calcite support on Beam SQL?

2024-03-04 Thread Robert Bradshaw via user
There is no longer a huge amount of active development going on here, but implementing a missing function seems like an easy contribution (lots of examples to follow). Otherwise, definitely worth filing a feature request as a useful signal for prioritization. On Mon, Mar 4, 2024 at 4:33 PM

Re: ParDo(DoFn) with multiple context.output vs FlatMapElements

2024-01-26 Thread Robert Bradshaw via user
There is no difference; FlatMapElements is implemented in terms of a DoFn that invokes context.output multiple times. And, yes, Dataflow will fuse consecutive operations automatically. So if you have something like ... -> DoFnA -> DoFnB -> GBK -> DoFnC -> ... Dataflow will fuse DoFnA and DoFnB

Re: Downloading and executing addition jar file when using Python API

2024-01-24 Thread Robert Bradshaw via user
On Wed, Jan 24, 2024 at 10:48 AM Mark Striebeck wrote: > > If point beam to the local jar, will beam start and also stop the expansion > service? Yes it will. > Thanks > Mark > > On Wed, 24 Jan 2024 at 08:30, Robert Bradshaw via user > wrote: >> >

Re: Downloading and executing addition jar file when using Python API

2024-01-24 Thread Robert Bradshaw via user
You can also manually designate a replacement jar to be used rather than fetching the jar from maven, either as a pipeline option or (as of the next release) as an environment variable. The format is a json mapping from gradle targets (which is how we identify these jars) to local files (or urls).

Re: TypeError: '_ConcatSequence' object is not subscriptable

2024-01-22 Thread Robert Bradshaw via user
This is probably because you're trying to index into the result of the GroupByKey in your AnalyzeSession as if it were a list. All that is promised is that it is an iterable. If it is large enough to merit splitting over multiple fetches, it won't be a list. (If you need to index, explicitly

Re: Does withkeys transform enforce a reshuffle?

2024-01-19 Thread Robert Bradshaw via user
Reshuffle is perfectly fine to use if the goal is just to redistribute work. It's only deprecated as a "checkpointing" mechanism. On Fri, Jan 19, 2024 at 9:44 AM Danny McCormick via user wrote: > > For runners that support Reshuffle, it should be safe to use. Its been > "deprecated" for 7

Re: How to debug ArtifactStagingService ?

2024-01-05 Thread Robert Bradshaw via user
Nothing problematic is standing out for me in those logs. A job service and artifact staging service is spun up to allow the job (and its artifacts) to be submitted, then they are shut down. What are the actual errors that you are seeing? On Wed, Jan 3, 2024 at 7:39 AM Lydian wrote: > > > Hi, >

Re: Dataflow not able to find a module specified using extra_package

2023-12-19 Thread Robert Bradshaw via user
And should it be a list of strings, rather than a string? On Tue, Dec 19, 2023 at 10:10 AM Anand Inguva via user wrote: > Can you try passing `extra_packages` instead of `extra_package` when > passing pipeline options as a dict? > > On Tue, Dec 19, 2023 at 12:26 PM Sumit Desai via user < >

Re: Streaming management exception in the sink target.

2023-12-05 Thread Robert Bradshaw via user
Currently error handling is implemented on sinks in an ad-hoc basis (if at all) but John (cc'd) is looking at improving things here. On Mon, Dec 4, 2023 at 10:25 AM Juan Romero wrote: > > Hi guys. I want to ask you about how to deal with the scenario when the > target sink (eg: jdbc, kafka,

Re: [QUESTION] Why no auto labels?

2023-10-20 Thread Robert Bradshaw via user
On Fri, Oct 13, 2023 at 1:32 PM Joey Tran wrote: >> >> >> >> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw wrote: >>> >>> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran >>> wrote: >>>> >>>> Are there places on the SDK s

Re: Advanced Composite Transform Documentation

2023-10-19 Thread Robert Bradshaw via user
On Thu, Oct 19, 2023 at 2:00 PM Joey Tran wrote: > > For the python SDK, is there somewhere where we document more "advance" > composite transform operations? I'm not sure, but https://beam.apache.org/documentation/programming-guide/ is the canonical palace information like this should probaby

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via user
d though. Another option is to make the suffix a uuid rather than a single counter. (This would still have issues with the first application possibly getting mixed up with a "different" first application unless it was always appended.) > On Fri, Oct 13, 2023 at 12:52 PM Robert Brad

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Robert Bradshaw via user
ssages). At least with the old, intersecting names we can detect this problem rather than silently give corrupt data. On Fri, Oct 13, 2023 at 7:15 AM Joey Tran wrote: > For posterity: https://github.com/apache/beam/pull/28984 > > On Tue, Oct 10, 2023 at 7:29 PM Robert Bradshaw > wrote: &

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Robert Bradshaw via user
. Playing with the examples, I wasn't positive > if my runs were actually succeeding or not based on the stdout alone. > > [1] https://play.beam.apache.org/?sdk=java=mI7WUeje_r2 > <https://play.beam.apache.org/?sdk=java=mI7WUeje_r2> > [2] https://play.beam.apache.org/?sdk=python=hIr

Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Robert Bradshaw via user
BeamJava and BeamPython have the exact same behavior: transform names within must be distinct [1]. This is because we do not necessarily know at pipeline construction time if the pipeline will be streaming or batch, or if it will be updated in the future, so the decision was made to impose this

Re: UDF/UADF over complex structures

2023-09-28 Thread Robert Bradshaw via user
Yes, for sure. This is one of the areas Beam excels vs. more simple tools like SQL. You can write arbitrary code to iterate over arbitrary structures in the typical Java/Python/Go/Typescript/Scala/[pick your language] way. In the Beam nomenclature. UDFs correspond to DoFns and UDAFs correspond to

Re: [Question] Side Input pattern

2023-09-15 Thread Robert Bradshaw via user
> > Is that assumption correct? > > > > El El vie, 15 de septiembre de 2023 a la(s) 10:59, Robert Bradshaw via > user escribió: > >> Beam will block on side inputs until at least one value is available (or >> the watermark has advanced such that we can be s

Re: [Question] Side Input pattern

2023-09-15 Thread Robert Bradshaw via user
Beam will block on side inputs until at least one value is available (or the watermark has advanced such that we can be sure one will never become available, which doesn't really apply to the global window case). After that, workers generally cache the side input value (for performance reasons)

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Robert Bradshaw via user
ct it's >> dofn, wrap it, and return a new ParDo >> >> On Fri, Sep 15, 2023, 11:53 AM Robert Bradshaw via user < >> user@beam.apache.org> wrote: >> >>> +1 to looking at composite transforms. You could even have a composite >>> transform tha

Re: "Decorator" pattern for PTramsforms

2023-09-15 Thread Robert Bradshaw via user
+1 to looking at composite transforms. You could even have a composite transform that takes another transform as one of its construction arguments and whose expand method does pre- and post-processing to the inputs/outputs before/after applying the transform in question. (You could even implement

Re: Options for visualizing the pipeline DAG

2023-09-01 Thread Robert Bradshaw via user
(As an aside, I think all of these options would make for a great blog post if anyone is interested in authoring one of those...) On Fri, Sep 1, 2023 at 9:26 AM Robert Bradshaw wrote: > You can also use Python's RenderRunner, e.g. > > python -m apache_beam.examples.wordcount --outpu

Re: Options for visualizing the pipeline DAG

2023-09-01 Thread Robert Bradshaw via user
You can also use Python's RenderRunner, e.g. python -m apache_beam.examples.wordcount --output out.txt \ --runner=apache_beam.runners.render.RenderRunner \ --render_output=pipeline.svg This also has an interactive mode, triggered by passing --port=N (where 0 can be used to pick an

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Robert Bradshaw via user
; to make and maybe even less typing for the user. I was originally thinking > side inputs and metrics would happen outside the loop, but I think you want > a class and not a closure at that point for sanity. > > On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw > wrote: > >

Re: [Request for Feedback] Swift SDK Prototype

2023-08-23 Thread Robert Bradshaw via user
Neat. Nothing like writing and SDK to actually understand how the FnAPI works :). I like the use of groupBy. I have to admit I'm a bit mystified by the syntax for parDo (I don't know swift at all which is probably tripping me up). The addition of external (cross-language) transforms could let you

Re: Getting Started With Implementing a Runner

2023-07-24 Thread Robert Bradshaw via user
be of interest. On Fri, Jul 21, 2023 at 7:25 AM Joey Tran wrote: > > Could you let me know when you update it? I would be interested in rereading > after the rewrite. > > Thanks! > Joey > > On Fri, Jul 14, 2023 at 4:38 PM Robert Bradshaw wrote: >> >> I'm takin

Re: Growing checkpoint size with Python SDF for reading from Redis streams

2023-07-20 Thread Robert Bradshaw via user
Your SDF looks fine. I wonder if there is an issue with how Flink is implementing SDFs (e.g. not garbage collecting previous remainders). On Tue, Jul 18, 2023 at 5:43 PM Nimalan Mahendran wrote: > > Hello, > > I am running a pipeline built in the Python SDK that reads from a Redis > stream via

Re: Getting Started With Implementing a Runner

2023-07-14 Thread Robert Bradshaw via user
nner-api-protos> docs > page which implied to me that they'd be safe to use. I'll check out the > bundle_processor. Thanks! > > On Mon, Jul 10, 2023 at 1:07 PM Robert Bradshaw > wrote: > >> On Sun, Jul 9, 2023 at 9:22 AM Joey Tran >> wrote: >> >>&

Re: Pandas 2 Timeline Estimate

2023-07-12 Thread Robert Bradshaw via user
Contributions welcome! I don't think we're at the point we can stop supporting Pandas 1.x though, so we'd have to do it in such a way as to support both. On Wed, Jul 12, 2023 at 4:53 PM XQ Hu via user wrote: > https://github.com/apache/beam/issues/27221#issuecomment-1603626880 > > This tracks

Re: Getting Started With Implementing a Runner

2023-07-10 Thread Robert Bradshaw via user
s you more flexibility when it >> comes to choosing an SDK to define the pipeline and will allow you to >> execute transforms in any SDK via cross-language. >> >> Thanks, >> Cham >> >> On Fri, Jun 23, 2023 at 1:57 PM Robert Bradshaw via user < >> us

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
t; (which in turn invokes the actual DoFns). This latter may be inlined (e.g. if it's 100% Python on both sides). See, for example, https://github.com/apache/beam/blob/v2.48.0/sdks/python/apache_beam/runners/portability/fn_api_runner/worker_handlers.py#L350 > On Fri, Jun 23, 2023 at 4:02 PM

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
h more straightforward than all the bundler scheduler stuff that currently exists in that code). > > > > > > On Fri, Jun 23, 2023 at 12:17 PM Alexey Romanenko < > aromanenko@gmail.com> wrote: > >> >> >> On 23 Jun 2023, at 17:40, Robert Bradshaw via use

Re: Getting Started With Implementing a Runner

2023-06-23 Thread Robert Bradshaw via user
On Fri, Jun 23, 2023, 7:37 AM Alexey Romanenko wrote: > If Beam Runner Authoring Guide is rather high-level for you, then, at > fist, I’d suggest to answer two questions for yourself: > - Am I going to implement a portable runner or native one? > The answer to this should be portable, as

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Robert Bradshaw via user
st.github.com/egalpin/162a04b896dc7be1d0899acf17e676b3 > > On Thu, May 25, 2023 at 2:25 PM Robert Bradshaw via user < > user@beam.apache.org> wrote: > >> The GbkBeforeStatefulParDo is an implementation detail used to send all >> elements with the same key to the same worker (so that t

Re: [Dataflow][Stateful] Bypass Dataflow Overrides?

2023-05-25 Thread Robert Bradshaw via user
The GbkBeforeStatefulParDo is an implementation detail used to send all elements with the same key to the same worker (so that they can share state, which is itself partitioned by worker). This does cause a global barrier in batch pipelines. On Thu, May 25, 2023 at 2:15 PM Evan Galpin wrote: >

Re: [EXTERNAL] Re: Vulnerabilities in Transitive dependencies

2023-05-02 Thread Robert Bradshaw via user
Generally these types of vulnerabilities are only exploitable when processing untrusted data and/or exposing a public service to the internet. This is not the typical use of Beam (especially the latter), but that's not to say Beam can't be used in this way. That being said, it's preferable to

Re: How Beam Pipeline Handle late events

2023-04-24 Thread Robert Bradshaw via user
On Fri, Apr 21, 2023 at 3:37 AM Pavel Solomin wrote: > > Thank you for the information. > > I'm assuming you had a unique ID in records, and you observed some IDs > missing in Beam output comparing with Spark, and not just some duplicates > produced by Spark. > > If so, I would suggest to

Re: [Question] - Time series - cumulative sum in right order with python api in a batch process

2023-04-24 Thread Robert Bradshaw via user
You are correct in that the data may arrive in an unordered way. However, once a window finishes, you are guaranteed to have seen all the data up to that point (modulo late data) and can then confidently compute your ordered cumulative sum. You could do something like this: def

Re: Avoid using docker when I use a external transformation

2023-04-18 Thread Robert Bradshaw via user
Docker is not necessary to expand the transform (indeed, by default it should just pull the Jar and invokes that directly to start the expansion service), but it is used as the environment in which to execute the expanded transform. It would be in theory possible to run the worker without docker

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-18 Thread Robert Bradshaw via user
n Mon, Apr 17, 2023 at 8:08 AM Reuven Lax wrote: >>>>> >>>>>> Are you running on the Dataflow runner? If so, Dataflow - unlike >>>>>> Spark and Flink - dynamically modifies the parallelism as the operator >>>>>> runs, so there is no need to h

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Robert Bradshaw via user
What are you trying to achieve by setting the parallelism? On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang wrote: > Thanks Reuven, what I mean is to set the parallelism in operator level. > And the input size of the operator is unknown at compiling stage if it is > not a source > operator, > >

Re: Message guarantees

2023-04-14 Thread Robert Bradshaw via user
That is correct. On Tue, Apr 11, 2023 at 5:44 AM Hans Hartmann wrote: > > Hello, > > i'm wondering if Apache Beam is using the message guarantees of the > execution engines, that the pipeline is running on. > > So if i use the SparkRunner the consistency guarantees are exactly-once? > > Have a

Re: Why is FlatMap different from composing Flatten and Map?

2023-03-15 Thread Robert Bradshaw via user
On Mon, Mar 13, 2023 at 11:33 AM Godefroy Clair wrote: > Hi, > I am wondering about the way `Flatten()` and `FlatMap()` are implemented > in Apache Beam Python. > In most functional languages, FlatMap() is the same as composing > `Flatten()` and `Map()` as indicated by the name, so Flatten() and

Re: Deduplicate usage

2023-03-02 Thread Robert Bradshaw via user
Whenever state is used, the runner will arrange such that the same keys will all go to the same worker, which often involves injecting a shuffle-like operation if the keys are spread out among many workers in the input. (An alternative implementation could involve storing the state in a

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-07 Thread Robert Bradshaw via user
Seams reasonable to me. On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user wrote: > > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have stopped > being built and supported since July 2022. I have filed [2] to track the > resolution of this issue. > > Based upon [1], almost

Re: How to submit beam python pipeline to GKE flink cluster

2023-02-03 Thread Robert Bradshaw via user
You should be able to omit the environment_type and environment_config variables and they will be populated automatically. For running locally, the flink_master parameter is not needed either (one will be started up automatically). On Fri, Feb 3, 2023 at 12:51 PM Talat Uyarer via user wrote: > >

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
I'm also not sure it's part of the contract that the containerization technology we use will always have these capabilities. On Mon, Jan 30, 2023 at 10:53 AM Chad Dombrova wrote: > > Hi Valentyn, > >> >> Beam SDK docker containers on Dataflow VMs are currently launched in >> privileged mode. >

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
Different idea: is it possible to serve this data via another protocol (e.g. sftp) rather than requiring a mount? On Mon, Jan 30, 2023 at 9:26 AM Chad Dombrova wrote: > > Hi Robert, > I know very little about the FileSystem classes, but I don’t think it’s > possible for a process running in

Re: Dataflow and mounting large data sets

2023-01-30 Thread Robert Bradshaw via user
If it's your input/output data, presumably you could implement a https://beam.apache.org/releases/javadoc/2.3.0/org/apache/beam/sdk/io/FileSystem.html for nfs. (I don't know what all that would entail...) On Mon, Jan 30, 2023 at 9:04 AM Chad Dombrova wrote: > > Hi Israel, > Thanks for

Re: [Python] Heterogeneous TaggedOutput Type Hints

2021-09-09 Thread Robert Bradshaw
> It seemed to run fine both on DirectRunner and PortableRunner (embed mode), > but Dataflow v2 runner raised an error at runtime seemingly associated with > the Shuffle service? I have job IDs and trace links if those are helpful as > well. > > Thanks, > Evan > >

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Robert Bradshaw
Awesome, thanks! On Mon, Jun 14, 2021 at 5:36 PM Evan Galpin wrote: > > I’ll try to create something as small as possible from the pipeline I > mentioned  I should have time this week to do so. > > Thanks, > Evan > > On Mon, Jun 14, 2021 at 18:09 Robert Bradshaw wro

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Robert Bradshaw
o sink > > It doesn't seem to matter if there are 0 messages in a subscription or 50k > messages at startup. The rate of new messages however is very low. Not sure > if those are helpful details, let me know if there's anything else specific > which would help. > > On Mon, Jun 14

Re: GroupIntoShards not sending bytes further when dealing with huge amount of data

2021-06-14 Thread Robert Bradshaw
+1, we'd really like to get to the bottom of this, so clear instructions on a pipeline/conditions that can reproduce it would be great. On Mon, Jun 14, 2021 at 7:34 AM Jan Lukavský wrote: > > Hi Eddy, > > you are probably hitting a not-yet discovered bug in SDF implementation in > FlinkRunner

Re: Issues running Kafka streaming pipeline in Python

2021-06-04 Thread Robert Bradshaw
Glad you were able to figure it out. Maybe it's moot with runner v2 becoming the default, but we really should give a clearer error in this case. On Wed, Jun 2, 2021 at 8:16 PM Chamikara Jayalath wrote: > > Great :) > > On Wed, Jun 2, 2021 at 8:15 PM Alex Koay wrote: >> >> Finally figured out

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Robert Bradshaw
On Wed, Jun 2, 2021 at 11:18 AM Vincent Marquez wrote: > > On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw wrote: >> >> If you want to control the total number of elements being processed >> across all workers at a time, you can do this by assigning random keys >

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Robert Bradshaw
pipelines (google > does not allow more than 25 dataflow pipelines per region) with 10 elements > each, I am launching the next 20 pipelines. > > This is ofcourse missing the benefit of serverless. > > Any idea, how to work around this? > > Best, > Eila > >

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-05-17 Thread Robert Bradshaw
Note that workers generally process one element per thread at a time. The number of threads defaults to the number of cores of the VM that you're using. On Mon, May 17, 2021 at 10:18 AM Brian Hulette wrote: > What type of files are you reading? If they can be split and read by > multiple

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Robert Bradshaw
re sorted by "key2". The >>> downstreaming process, for example, will make a rolling window with size N >>> that reads N records together at one time. But note, the rolling window >>> will not cross different "key1". >>> >>> So that is

Re: Question on printing out a PCollection

2021-04-30 Thread Robert Bradshaw
Sorry, no Java versions of this stuff (though it may be possible to use cross-language to invoke your Java pipeline from Python and get the benefits that way). On Fri, Apr 30, 2021 at 11:30 AM Tao Li wrote: > > Thanks @Ning Kang. > > @Robert Bradshaw I assume you are referring

Re: Question on printing out a PCollection

2021-04-30 Thread Robert Bradshaw
You can also use interactive Beam's collect, to get the PCollection as a Dataframe, and then print it or do whatever else with it as you like. On Fri, Apr 30, 2021 at 10:24 AM Ning Kang wrote: > > Hi Tao, > > The `show()` API works with any IPython notebook runtimes, including Colab, > Jupyter

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-20 Thread Robert Bradshaw
to distribute my dataset to >>>>> nodes, sort each partition by some key and then store each partition to >>>>> its >>>>> own file. >>>>> >>>>> Wenbing >>>>> >>>>> On Fri, Apr 2, 2021 at 9:

Re: Beam Dataframe - sort and grouping

2021-04-02 Thread Robert Bradshaw
Thanks for trying this out. Better support for groupby (e.g. https://github.com/apache/beam/pull/13843 , https://github.com/apache/beam/pull/13637) will be available in the next Beam release (2.29, in progress, but you could try out head if you want). Note, however, that Beam PCollections are by

Re: [Question] Need to write a pipeline in Go consuming events from Kafka

2021-03-29 Thread Robert Bradshaw
On Wed, Mar 24, 2021 at 4:24 AM Đức Trần Tiến wrote: > > And the last question: Could I write that pipeline in Java and invoke that > pipeline from Go? :D > That is exactly the story we're trying to pursue for getting the large set of Java connectors available to Go:

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Robert Bradshaw
gt; >> Am I reading this wrong? >> >> Kenn >> >> On Wed, Mar 24, 2021 at 4:35 PM Alex Amato wrote: >> >>> How about a PCollection containing every element which was successfully >>> written? >>> Basically the same things which were passed into it. >

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Robert Bradshaw
nd just add new > funtionality. Though, we need to follow the same pattern for user API and > maybe even naming for this feature across different IOs (like we have for > "readAll()” methods). > > > > I agree that we have to avoid returning PDone for such cases. > >

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Robert Bradshaw
Returning PDone is an anti-pattern that should be avoided, but changing it now would be backwards incompatible. PRs to add non-PDone returning variants (probably as another option to the builders) that compose well with Wait, etc. would be welcome. On Wed, Mar 24, 2021 at 11:14 AM Alexey

Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-12 Thread Robert Bradshaw
Do we now support 1.8 through 1.12? Unless there are specific objections, makes sense to me. On Fri, Mar 12, 2021 at 8:29 AM Alexey Romanenko wrote: > +1 too but are there any potential objections for this? > > On 12 Mar 2021, at 11:21, David Morávek wrote: > > +1 > > D. > > On Thu, Mar 11,

Re: Overwrite support from ParquetIO

2021-01-27 Thread Robert Bradshaw
Fortunately making deleting files idempotent is much easier than writing them :). But one needs to handle the case of concurrent execution as well as sequential re-execution due to possible zombie workers. On Wed, Jan 27, 2021 at 5:04 PM Reuven Lax wrote: > Keep in mind thatt DoFns might be

Re: Is there an array explode function/transform?

2021-01-14 Thread Robert Bradshaw
all the different array > elements? > > On Thu, Jan 14, 2021 at 11:25 AM Robert Bradshaw > wrote: > >> I think it makes sense to allow specifying more than one, if desired. >> This is equivalent to just stacking multiple Unnests. (Possibly one could >> even have a s

Re: Is there an array explode function/transform?

2021-01-14 Thread Robert Bradshaw
ly could be a top-level transform. Should it automatically >>>>> unnest all arrays, or just the fields specified? >>>>> >>>>> We do have to define the semantics for nested arrays as well. >>>>> >>>>> On Wed, Jan 13, 2021 at 1

Re: Is there an array explode function/transform?

2021-01-13 Thread Robert Bradshaw
Ah, thanks for the clarification. UNNEST does sound like what you want here, and would likely make sense as a top-level relational transform as well as being supported by SQL. On Wed, Jan 13, 2021 at 10:53 AM Tao Li wrote: > @Kyle Weaver sure thing! So the input/output > definition for the

Re: Help measuring upcoming performance increase in flink runner on production systems

2020-12-21 Thread Robert Bradshaw
I agree. Borrowing the mutation detection from the direct runner as an intermediate point sounds like a good idea. On Mon, Dec 21, 2020 at 8:57 AM Kenneth Knowles wrote: > I really think we should make a plan to make this the default. If you test > with the DirectRunner it will do mutation

Re: is apache beam go sdk supported by spark runner?

2020-11-25 Thread Robert Bradshaw
Yes, it should be for batch (just like for Python). There is ongoing work to make it work for Streaming as well. On Sat, Nov 21, 2020 at 2:57 PM Meriem Sara wrote: > > Hello everyone. I am trying to use apache beam with Golang to execute a data > processing workflow using apache Spark.

Re: Support for Flink 1.11

2020-10-16 Thread Robert Bradshaw
Support for Flink 1.11 is https://issues.apache.org/jira/browse/BEAM-10612 . It has been implemented and will be included in the next release (Beam 2.25). In the meantime, you could try building yourself from head. On Fri, Oct 16, 2020 at 4:39 AM Kishor Joshi wrote: > > Hi Team, > > Since the

Re: Upload third party runtime dependencies for expanding transform like KafkaIO.Read in Python Portable Runner

2020-10-02 Thread Robert Bradshaw
at > org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:157) > > On Fri, Oct 2, 2020 at 5:14 PM Robert Bradshaw wrote: >> >> Could you clarify a bit exactly what you're trying to do? When using &

Re: [DISCUSS] Deprecation of AWS SDK v2 IO connectors

2020-09-15 Thread Robert Bradshaw
should > deprecate a v1 IO ONLY when we have full feature parity in the v2 version. > I think we don't have a replacement for AWSv1 S3 IO so that one should not > be > deprecated. > > On Tue, Sep 15, 2020 at 6:07 PM Robert Bradshaw > wrote: > > > > The 10x-100x

Re: Info needed - pmc mailing list

2020-08-25 Thread Robert Bradshaw
Try priv...@beam.apache.org. On Tue, Aug 25, 2020 at 6:18 AM D, Anup (Nokia - IN/Bangalore) wrote: > > Hi, > > > > We would like to know if there is a way to reach out to members of the pmc > group. > > We tried sending email to p...@beam.apache.org but it got bounced. > > > > Thanks > > Anup

Re: Out-of-orderness of window results when testing stateful operators with TextIO

2020-08-24 Thread Robert Bradshaw
As for the question of writing tests in the face of non-determinism, you should look into TestStream. MyStatefulDoFn still needs to be updated to not assume an ordering. (This can be done by setting timers that provide guarantees that (modulo late data) one has seen all data up to a certain

Re: Resource Consumption increase With TupleTag

2020-08-19 Thread Robert Bradshaw
Is this 2kps coming out of Filter1 + 2kps coming out of Filter2 (which would be 4kps total), or only 2kps coming out of KafkaIO and MessageExtractor? Though it /shouldn't/ matter, due to sibling fusion, there's a chance things are getting fused poorly and you could write Filter1 and Filter2

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-17 Thread Robert Bradshaw
I checked Java, it looks like the way things are structured we do not have that bug there. On Mon, Aug 17, 2020 at 3:31 PM Robert Bradshaw wrote: > > +1 > > Thanks, Eugene, for finding and fixing this! > > FWIW, most use of Python from the Python Portable Runner used the >

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-17 Thread Robert Bradshaw
+1 Thanks, Eugene, for finding and fixing this! FWIW, most use of Python from the Python Portable Runner used the embedded environment (this is the default direct runner), so dependencies are already present. On Mon, Aug 17, 2020 at 3:19 PM Daniel Oliveira wrote: > > Normally I'd say not to

Re: Scio 0.9.3 released

2020-08-05 Thread Robert Bradshaw
Thanks for the update! On Wed, Aug 5, 2020 at 11:46 AM Neville Li wrote: > > Hi all, > > We just released Scio 0.9.3. This bumps Beam SDK to 2.23.0 and includes a lot > of improvements & bug fixes. > > Cheers, > Neville > > https://github.com/spotify/scio/releases/tag/v0.9.3 > > "Petrificus

Re: ReadFromKafka returns error - RuntimeError: cannot encode a null byte[]

2020-07-22 Thread Robert Bradshaw
On Sat, Jul 18, 2020 at 12:08 PM Chamikara Jayalath wrote: > > > On Fri, Jul 17, 2020 at 10:04 PM ayush sharma <1705ay...@gmail.com> wrote: > >> Thank you guys for the reply. I am really stuck and could not proceed >> further. >> Yes, the previous trial published message had null key. >> But

Re: Testing Apache Beam pipelines / python SDK

2020-07-21 Thread Robert Bradshaw
t is the best way to test writing to BigQuery? > I have seen this file > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/bigtableio_it_test.py > but it appears it writes to real big query? > > kind regards > Marco > > > > > &g

Re: Testing Apache Beam pipelines / python SDK

2020-07-17 Thread Robert Bradshaw
Collection as result of the pipeline, and i > would be testing the content of the PCollection > > Running this results in this messsage > > IT is skipped because --test-pipeline-options is not specified > > Would you be able to advise on this? > > kind regards > > Marco

Re: Testing Apache Beam pipelines / python SDK

2020-07-13 Thread Robert Bradshaw
You can use apache_beam.testing.util.assert_that to write tests of Beam pipelines. This is what Beam uses for its tests, e.g. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util_test.py#L80 On Mon, Jul 13, 2020 at 2:36 PM Sofia’s World wrote: > > Hi all > i was

Re: Not able to see WordCount output in docker /tmp/...

2020-07-07 Thread Robert Bradshaw
Does it work when you write to a distributed filesystem? (One issue with Docker is that the manager and each of their workers have their own local filesystem.) On Tue, Jul 7, 2020 at 2:17 PM Avijit Saha wrote: > > While trying to run the Beam WordCount example on Flink runner using Job >

Re: Understanding combiner's distribution logic

2020-07-01 Thread Robert Bradshaw
On Tue, Jun 30, 2020 at 3:32 PM Julien Phalip wrote: > Thanks Luke! > > One part I'm still a bit unclear about is how exactly the PreCombine stage > works. In particular, I'm wondering how it can perform the combination > before the GBK. Is it because it can already compute the combination on >

Re: PaneInfo showing UNKOWN State

2020-05-26 Thread Robert Bradshaw
To clarify, PaneInfo is supported on the FnAPI local runner, but not on the bundle based one. Unfortunately, Streaming is not supported on the FnAPI one (yet), but work there is ongoing. On Tue, May 26, 2020 at 11:46 AM Pablo Estrada wrote: > Hi Jayadeep, > Unfortunately, it seems that PaneInfo

Re: Timers exception on "Job Drain" while using stateful beam processing in global window

2020-05-18 Thread Robert Bradshaw
Glad you were able to get this working; thanks for following up. On Mon, May 18, 2020 at 10:35 AM Mohil Khare wrote: > Hi, > On another note, I think I was unnecessarily complicating things by > applying a sliding window here and then an extra global window to remove > duplicates. I replaced

Re: Portable Runner performance optimisation

2020-05-15 Thread Robert Bradshaw
I don't think this is being worked on, but given that Java already supports the LOOPBACK environment (which is a special case of EXTERNAL) it would just be a matter of properly parsing the flags. On Fri, May 15, 2020 at 9:52 AM Alexey Romanenko wrote: > Thanks! It looks that this is exactly

Re: adding apt-get to setup.py fails passing apt-get commands

2020-05-04 Thread Robert Bradshaw
Printing the result out shouldn't matter, but as mentioned in the doce Popen.communicate is not intended to be used when the amount of output is large. If you need to just run the process, I would recommend a simple subprocess.check_output(). On Mon, May 4, 2020 at 9:00 AM OrielResearch Eila

Re: Kafka IO: value of expansion_service

2020-04-28 Thread Robert Bradshaw
https://github.com/apache/beam/pull/11557 On Tue, Apr 28, 2020 at 9:28 AM Robert Bradshaw wrote: > Java dependencies are not yet fully propagated over the expansion service, > which might be what you're running into. I'm actually in the process of > putting together a PR to fix this;

Re: Kafka IO: value of expansion_service

2020-04-28 Thread Robert Bradshaw
Java dependencies are not yet fully propagated over the expansion service, which might be what you're running into. I'm actually in the process of putting together a PR to fix this; I'll let you know when it's ready. On Mon, Apr 27, 2020 at 9:14 AM Kyle Weaver wrote: > I'm not sure about the

Re: Stateful & Timely Call

2020-04-23 Thread Robert Bradshaw
I may have misinterpreted your email, I thought you didn't have a need for keys at all. If this is actually the case, you don't need a GroupByKey, just have your DoFn take Rows as input, and emit List as output. That is, it's a DoFn>. You can buffer multiple Rows in an instance variable between

  1   2   3   >