KafkaIO error

2019-01-22 Thread Vilhelm von Ehrenheim
job. What I am wondering is essentially if this means that I am loosing data or if this will be retried by the sink? Also if this means losing the record what is the best way to configure the KafkaIO sink to be less aggressive? I am still Beam 2.8 in this pipeline. Regards, Vilhelm von Ehrenheim

Re: Multiple firings on side input

2018-04-16 Thread Vilhelm von Ehrenheim
; particular key and find the last. It is not ideal, either in clarity or > performance, but it can work for some cases until we have retraction > support. > > Apologies for typos or broken code here, as I am just typing it in email > without checking its compilation or behavio

Multiple firings on side input

2018-04-16 Thread Vilhelm von Ehrenheim
from the latest trigger firing. This is particularly useful if you use a side input with a single global window and specify a trigger. Thanks! // Vilhelm von Ehrenheim ​

Count on non-global windows

2018-02-26 Thread Vilhelm von Ehrenheim
for the `Count.globally()` transform. I can implement this using my own CompbineFn or `Sum` transform instead of course but still thought it was a bit strange. Is this a bug or why is this happening? Regards, Vilhelm von Ehrenheim

Re: TextIO.watchForNewFiles and Watermarks

2018-02-22 Thread Vilhelm von Ehrenheim
ses FileSystems.match() to list > the files and computes file timestamps and watermark in the way appropriate > to your use case. > > On Wed, Feb 21, 2018 at 7:29 AM Vilhelm von Ehrenheim < > vonehrenh...@gmail.com> wrote: > >> Hi! >> I have some problems with w

Re: Missing watermark?

2018-02-20 Thread Vilhelm von Ehrenheim
in KafkaIO are being updated in > https://github.com/apache/beam/pull/4680. > - TextIO.watchForNewFiles() - I am not sure how the watermark is handled > by TextIO. Didn't notice any mentions of in implementation. > > On Tue, Feb 20, 2018 at 10:13 AM, Vilhelm von Ehrenheim < >

Re: Global sum of latest help

2017-12-05 Thread Vilhelm von Ehrenheim
PM, Vilhelm von Ehrenheim < vonehrenh...@gmail.com> wrote: > No the order is not so important as long as it is correct and doesnt emit > sums for late values. > > {"id": "2", "parent_id": "a", "timestamp": 2, "amount"

Re: Global sum of latest help

2017-12-05 Thread Vilhelm von Ehrenheim
gt;> {"id": "1", parent_id: "a", "timestamp": 3, "amount": 2} >>> ``` >>> would the output be 3 and then 5 or would you still want 1, 4, and then >>> 5? >>> >> >> My own guess here would

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Vilhelm von Ehrenheim
I'm super excited about this release! Great work everyone involved! On Mon, Dec 4, 2017 at 10:58 AM, Jean-Baptiste Onofré wrote: > Just an important note that we forgot to mention. > > !! The 2.2.0 release will be the last one supporting Spark 1.x and Java 7 > !! > > Starting

Re: Is anyone using Beam for geo use cases?

2017-10-20 Thread Vilhelm von Ehrenheim
I am interested! I need to do this soon but havent started yet. Would love to see an example of how you do it. :) Br, Vilhelm von Ehrenheim On 20 Oct 2017 12:40, "Csaba Kassai" <csaba.kas...@doctusoft.com> wrote: > Hi Jacob, > > we are doing the opposite direction

Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-17 Thread Vilhelm von Ehrenheim
+1 On Tue, Oct 17, 2017 at 8:22 PM, Thomas Groh wrote: > I'm pretty strongly in favor of phasing out Java7 support, especially > given that it was EoL'd more than two years ago. However, I'm not sure how > this interacts with the repository's backwards-compatibility guarantees

Re: Beam and Python API: Pandas/Numpy?

2017-09-29 Thread Vilhelm von Ehrenheim
Hi Steve! I have several pipelines that successfully use both numpy and scikit models without any problems. I don't think I use Pandas atm but I'm sure that is fine too. However, you might have to do some special stuff if you encounter serializabillity problems. I also have tensorflow models in

Best way to read files with timestamps in names

2017-09-21 Thread Vilhelm von Ehrenheim
interesting to know what the best approach would be. Thanks! Vilhelm von Ehrenheim

Re: Testing transforms with generics using PAssert.containsAnyOrder

2017-09-21 Thread Vilhelm von Ehrenheim
you so much and sorry for my confusion! Still a bit interesting that I didn't need to do this explicitly when running in Dataflow but it was needed in the Testing case. Br, Vilhelm On Wed, Sep 20, 2017 at 11:50 PM, Vilhelm von Ehrenheim < vonehrenh...@gmail.com> wrote: > I dont hav

Re: Testing transforms with generics using PAssert.containsAnyOrder

2017-09-20 Thread Vilhelm von Ehrenheim
fferent encodings or processing, it gets > harder to reason about the coder decoding the original inputs with full > fidelity - splitting these into multiple PCollections is preferable if they > need to be processed as their original type. > > On Wed, Sep 20, 2017 at 2:24 PM, Vilhelm v

Re: Testing transforms with generics using PAssert.containsAnyOrder

2017-09-20 Thread Vilhelm von Ehrenheim
; >> List records = … // create test records >> >> PCollection inputRecords = pipeline.apply(Create.of(records)); >> >> PCollection output = >> input.apply(ParDo.of(myTagByConsumptionTypeFunc)); >> >> PAssert.that(output).satisfies(new TestSatisfies&

Problem with autoscaling

2017-09-04 Thread Vilhelm von Ehrenheim
ice/dataflow-service-desc#autoscaling Does anyone know if this is true? I know this is not the forum for Dataflow questions in general but I though someone else here might have experience that support or contradict this. Thanks, Vilhelm von Ehrenheim

Re: Reading and writing to external services in DoFns

2017-07-29 Thread Vilhelm von Ehrenheim
as you add to the set of previous records. However, if the index of previous records can fit into memory on the nodes I would recommend to use a side input instead that you do the check against in the DoFn. That should both be fast and work well in streaming. Hope it helps. Br, Vilhelm von Ehrenheim

[Python] Stateful processing in Python SDK

2017-07-25 Thread Vilhelm von Ehrenheim
can loop over in a nice way? If I read the whole set I'll most likely run out of memory. I've found that there exist stateful processing in the Java SDK but it seems to be missing in python still. Any help/ideas are greatly appreciated. Thanks, Vilhelm von Ehrenheim

Re: Best way to load heavy object into memory on nodes (python sdk)

2017-05-25 Thread Vilhelm von Ehrenheim
; loads and caches the ML model from a side input or returns the singleton if >> it has been loaded. >> You'll want to use some form of locking to ensure that you really only >> load the ML model once. >> >> On Wed, May 24, 2017 at 6:18 AM, Vilhelm von Ehrenheim < >&

Re: Sample Data

2017-05-24 Thread Vilhelm von Ehrenheim
, Vilhelm von Ehrenheim ​ On Wed, May 24, 2017 at 8:37 AM, Prabeesh K. <prabsma...@gmail.com> wrote: > > How to we can take sample data in Python? >