pause step until another step completes

2017-09-05 Thread Jacob Marble
Good morning- Given a batch pipeline with 3 file inputs and 4 file outputs, is there a way to prevent the 4 TextIO.write() steps from starting until all of the TexIO.write() steps are ready? The idea here is to fail on any exceptions before persisting any output data, making cleanup easier. Than

DoFn setup/teardown sequence

2017-10-15 Thread Jacob Marble
(there might be documentation on this that I didn't find; if so a link is sufficient) Good evening, this is just a check on my understanding. It looks like an instance of a given DoFn goes through this lifecycle. Am I correct? - constructor - @Setup (once) - @StartBundle (zero to many times)

Re: DoFn setup/teardown sequence

2017-10-16 Thread Jacob Marble
DoFn, @StartBundle is >> called >> for a set of data (bundle), @ProcessElement is for each element in the >> bundle/collection, @FinishBundle at the end of the dataset (bundle), >> @Teardown is called when the DoFn is "removed". >> &g

How to window by quantity of data?

2017-10-17 Thread Jacob Marble
My first streaming pipeline is pretty simple, it just pipes a queue into files: - read JSON objects from PubsubIO - event time = processing time - 5 minute windows ( - write n files to GCS, (TextIO.withNumShards() not dynamic) When the pipeline gets behind (for example, when the pipeline is stopp

Re: How to window by quantity of data?

2017-10-17 Thread Jacob Marble
usecase. > > 1: https://beam.apache.org/blog/2017/08/28/timely-processing.html > > On Tue, Oct 17, 2017 at 2:46 PM, Jacob Marble wrote: > >> My first streaming pipeline is pretty simple, it just pipes a queue into >> files: >> >> - read JSON objects from Pub

Re: How to window by quantity of data?

2017-10-18 Thread Jacob Marble
taining the code. > > Other users may be interested in your solution. > > On Tue, Oct 17, 2017 at 9:05 PM, Jacob Marble wrote: > >> Lukasz- >> >> That worked. I created a stateful DoFn with a stale timer, an initial >> timestamp state, and a counter state, alon

Re: How to window by quantity of data?

2017-10-18 Thread Jacob Marble
if processing a bundle fails, any state changes are > discarded and the state is reset to what it was before the bundle was > processed. > > On Wed, Oct 18, 2017 at 12:15 PM, Jacob Marble > wrote: > >> Here's a gist: https://gist.github.com/jacobm >> arble/6ca40e0

Re: How to window by quantity of data?

2017-10-18 Thread Jacob Marble
That gist isn't working right now, but I'll update it when I find the bug. The direct runner grows memory, but never writes files. The dataflow runner writes temp files, but FinalizeGroupByKey never moves them to the final destination. Jacob On Wed, Oct 18, 2017 at 12:55 PM, Jacob Mar

Re: How to window by quantity of data?

2017-10-18 Thread Jacob Marble
Thomas, I reworked using the GroupIntoBatches PTransform, and things working great (with fewer lines of code). Thanks Jacob On Wed, Oct 18, 2017 at 1:12 PM, Jacob Marble wrote: > That gist isn't working right now, but I'll update it when I find the bug. > > The direct runne

Is anyone using Beam for geo use cases?

2017-10-19 Thread Jacob Marble
Is anyone using Beam to solve geo problems? For example, a simple "reverse geo" function: f(lat, lon) => country, state/province/etc, city, postal code Jacob

Re: Is anyone using Beam for geo use cases?

2017-10-20 Thread Jacob Marble
n 20 Oct 2017 12:40, "Csaba Kassai" wrote: > >> Hi Jacob, >> >> we are doing the opposite direction: we enrich data with geo coordinates >> from textual address using Google Maps API with Cloud Dataflow. >> Are you interested in this use-case? >>

Re: How to window by quantity of data?

2017-10-20 Thread Jacob Marble
Final, working version is in the original gist: https://gist.github.com/jacobmarble/6ca40e0a14828e6a0dfe89b9cb2e4b4c The result is heavily inspired by GroupIntoBatches, and doesn't use windowing. Jacob On Wed, Oct 18, 2017 at 2:49 PM, Jacob Marble wrote: > Thomas, I reworked u

"processing lull"

2017-10-29 Thread Jacob Marble
Good evening- What should I make of the log warning "processing lull for [instant] in state windmill-read" ? - This happens in a streaming pipeline, in Dataflow. - The DoFn that emits the log entry makes HTTP requests to a third-party. - This only happens when I added a side input to the PTransfo

Re: "processing lull"

2017-10-29 Thread Jacob Marble
a map operation is taking too long Jacob On Sun, Oct 29, 2017 at 9:15 PM, Jacob Marble wrote: > Good evening- > > What should I make of the log warning "processing lull for [instant] in > state windmill-read" ? > > - This happens in a streaming pipeline, in Dataflow.

Re: "processing lull"

2017-10-29 Thread Jacob Marble
- Upstream steps seem to be slowed down by this PTransform (system lag up and elements/sec down) - Unbounded source is PubSubIO Jacob On Sun, Oct 29, 2017 at 9:22 PM, Jacob Marble wrote: > - This does not happen when I don't use the reshuffle hack > - HTTP QPS seems to be im

DoFn.OnTimer thread safe?

2017-10-30 Thread Jacob Marble
In a DoFn instance, is an OnTimer method always called when no ProcessElement method is active? Jacob

Windowing in a batch pipeline

2017-11-08 Thread Jacob Marble
Good evening. I'm trying to nail down windowing. The concept is clear, just struggling with writing a working pipeline. Tonight the goal is group events by key and window, in a batch pipeline. All data is "late" because it's a batch pipeline, and I expect nothing to be dropped or processed in a "la

Re: Windowing in a batch pipeline

2017-11-09 Thread Jacob Marble
d, Nov 8, 2017 at 5:33 PM, Jacob Marble wrote: > > Good evening. I'm trying to nail down windowing. The concept is clear, > just > > struggling with writing a working pipeline. Tonight the goal is group > events > > by key and window, in a batch pipeline. All data is &qu

@DoFn.Setup not called

2017-11-16 Thread Jacob Marble
This one is weird. A DoFn I wrote: - stateful - used plenty in a streaming pipeline - direct and dataflow runners - works fine Now: - new batch pipeline - @DoFn.Setup method not called - direct runner works properly (logs from setup method are output) - dataflow runner simply doesn't call the set

Re: [VOTE] Choose the "new" Spark runner

2017-11-16 Thread Jacob Marble
[ ] Use Spark 1 & Spark 2 Support Branch [X] Use Spark 2 Only Branch Spark 2 has been out for a while, so probably not going to offend many people. Jacob On Thu, Nov 16, 2017 at 5:45 AM, Neville Dipale wrote: > [ ] Use Spark 1 & Spark 2 Support Branch > [X] Use Spark 2 Only Branch > >

Re: Slack Channel

2017-11-16 Thread Jacob Marble
Me too, if you don't mind. Jacob On Thu, Nov 9, 2017 at 2:09 PM, Lukasz Cwik wrote: > Invite sent, welcome. > > On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang wrote: > >> Hi, >> >> Please add me to the slack channel. >> >> Thanks, >> Fred >> >> Ps. I think "BeamTV" would be a great YouTube channel

Re: @DoFn.Setup not called

2017-11-17 Thread Jacob Marble
t;> ​I've been using DoFn.Setup method in Dataflow and it seems to be working >> fine.​ >> >> On Thu, Nov 16, 2017 at 4:56 PM, Jacob Marble >> wrote: >> >>> This one is weird. >>> >>> A DoFn I wrote: >>> - stateful >>> - u

Re: @DoFn.Setup not called

2017-11-17 Thread Jacob Marble
lly use uninitialized things. Downside: it might mask > repeated initialization and only manifest as poor performance. > > Kenn > > On Fri, Nov 17, 2017 at 9:00 AM, Jacob Marble wrote: > >> I tried to write a simpler DoFn that induces the error, but it works >> fine. Workin

Re: @DoFn.Setup not called

2017-11-17 Thread Jacob Marble
ng doesn't reduce the worker quantity when step 1 completes. Since step 2 doesn't speed up with more workers, it would be best if it could start as soon as step 1 starts. This way, the job completes faster and uses fewer resources. Jacob On Fri, Nov 17, 2017 at 12:23 PM, Jacob Marble

Re: @DoFn.Setup not called

2017-11-17 Thread Jacob Marble
I also notice that stateful DoFn's seem to only be instantiated once in Dataflow, but multiple instances do end up being created in the direct runner. Is there a story behind that? Jacob On Fri, Nov 17, 2017 at 7:22 PM, Jacob Marble wrote: > Noticing some related and unexpected dif

unique DoFn id

2017-11-19 Thread Jacob Marble
Is there a recommended way to get a unique id for each instance of a DoFn? - DataflowWorkerHarnessOptions.getWorkerId() only returns a unique id per worker, which can contain multiple instances of a DoFn. - Looks like ThreadLocalRandom is seeded with the same value on every instance - Thinking I'l

Re: unique DoFn id

2017-11-19 Thread Jacob Marble
> readObject()? > > On Sun, Nov 19, 2017 at 8:17 AM Jacob Marble wrote: > >> Is there a recommended way to get a unique id for each instance of a >> DoFn? >> >> - DataflowWorkerHarnessOptions.getWorkerId() only returns a unique id >> per worker, which can con

Re: unique DoFn id

2017-11-19 Thread Jacob Marble
he constructor - it directly > creates an instance of the class (even if it doesn't declare a default > constructor) and repopulates fields. > > On Sun, Nov 19, 2017, 7:07 PM Jacob Marble wrote: > >> Eugene, that worked. Can you explain why this doesn't work when I set

Re: @DoFn.Setup not called

2017-11-21 Thread Jacob Marble
Nov 17, 2017 at 12:23 PM, Jacob Marble > wrote: > >> Here is a small pipeline job that fails using the Dataflow runner, but >> doesn't fail using the direct runner. >> >> https://gist.github.com/jacobmarble/804c2edb9c80a2863f3e671d6851a55f >> >>

Dataflow fusion

2017-12-07 Thread Jacob Marble
Is there a long-term plan for preventing fusion in Dataflow pipelines? Maybe a simple flag --disableFusion ? I have read a few of the discussions in the Beam mailing lists, and I haven't found any sentiment that something should be changed about Dataflow, only that Dataflow users should work aroun

Re: Dataflow fusion

2017-12-07 Thread Jacob Marble
the Beam SDK to > turn a PCollection into a PCollection> and then writing a > DoFn that processes batches, amortizing communication costs across > elements.) > > - Robert > > > > > On Thu, Dec 7, 2017 at 12:33 PM, Jacob Marble wrote: > > Is there a lo

Troubleshooting live Java process in Dataflow

2017-12-07 Thread Jacob Marble
In this post, Reuven says: "we had to ssh to the VMs to get actual thread profiles from workers. After a bit of digging, we found threads were often stuck in the following stack trace" Can someone describe what tools you use to do this? I logged into a Dataflow runner, found that the gc is thrash

Re: Dataflow fusion

2017-12-12 Thread Jacob Marble
Sorry, can you tell me about the implicit PubSub dedupe? Jacob On Tue, Dec 12, 2017 at 5:38 PM, Jacob Marble wrote: > This is a very simple streaming job. GBK would be overkill. > > Jacob > > On Fri, Dec 8, 2017 at 10:58 AM, Raghu Angadi wrote: > >> Jacob, >>

Re: Dataflow fusion

2017-12-12 Thread Jacob Marble
izing waste. Is this a streaming job? There is already an implicit > shuffle for pubsub dedup. > > On Thu, Dec 7, 2017 at 2:24 PM, Jacob Marble wrote: > >> re asynchronicity, I agree that simply having 1000 threads blocking on >> IO is also not ideal. >> >> Be

Re: Troubleshooting live Java process in Dataflow

2017-12-12 Thread Jacob Marble
That's perfect, should have known. Thanks Jacob On Fri, Dec 8, 2017 at 7:17 AM, Marián Dvorský wrote: > The worker exports an HTTP server with a handler for "/threadz": > > $ curl localhost:8081/threadz > > gives the stack traces for all Java threads. > > O

Re: Dependencies and Datastore

2018-01-30 Thread Jacob Marble
Josh, what did you do to work around this? This suddenly crept up on a production pipeline yesterday, without anything changing on our side (we do rebuild at every run). Jacob On Fri, Dec 8, 2017 at 6:46 PM, Chamikara Jayalath wrote: > Created https://issues.apache.org/jira/browse/BEAM-3321 to

Re: Dependencies and Datastore

2018-01-30 Thread Jacob Marble
Also weird, things work fine (no need to force gax to any version other than 1.3.1) when I build and run from my workstation, but fail on our CI/CD worker. Jacob On Tue, Jan 30, 2018 at 4:55 PM, Jacob Marble wrote: > Josh, what did you do to work around this? > > This suddenly crep

Re: Dependencies and Datastore

2018-01-31 Thread Jacob Marble
+- com.google.api:gax-grpc:jar:0.20.0:compile > > > Here are the parts of my pom that involve gax and Beam > > > > org.apache.beam > beam-runners-google-cloud-dataflow-java > 2.2.0 > > > > > com.google.api > gax > 1.15.0 > > > > &

working with hot keys

2018-02-12 Thread Jacob Marble
When joining (Join.leftOuterJoin etc) a PCollection to PCollection, and K:V1 contains hot keys, my pipeline gets very slow. It can bring processing time from hours to days. Reading this blog post I

Re: working with hot keys

2018-02-13 Thread Jacob Marble
matter. > I'll try this. K:V2 has unique keys and doesn't change a lot from day-to-day, so I'll make that a side input. Should I expect this method to perform significantly slower than Join.someJoin/CoGBK? On Mon, Feb 12, 2018 at 1:53 PM, Jacob Marble wrote: > >> When j

Re: working with hot keys

2018-02-13 Thread Jacob Marble
ending on whether your running > a pipeline with bounded or unbounded PCollections, PCollection sizes, side > input access pattern. > > On Tue, Feb 13, 2018 at 9:08 AM, Jacob Marble wrote: > >> On Mon, Feb 12, 2018 at 3:59 PM, Lukasz Cwik wrote: >> >>>