Re: Joining PCollections to aggregates of themselves

2019-10-11 Thread Kenneth Knowles
This seems a great example of use of stateful DoFn. It has essentially the
same structure as the example on the Beam blog but is more meaningful.

Kenn

On Fri, Oct 11, 2019 at 12:38 PM Robert Bradshaw 
wrote:

> OK, the only way to do this would be via a non-determanistic stateful
> DoFn that buffers elements as they come in and computes averages by
> looking at the buffer each time.
>
> This could also be represented with an extension to window merging and
> a join, where the trigger would be explicitly used to control the
> balance between latency and correctness.
>
> On Fri, Oct 11, 2019 at 8:01 AM Sam Stephens 
> wrote:
> >
> > On 2019/10/10 18:23:46, Eugene Kirpichov  wrote:
> > > " input elements can pass through the Joiner DoFn before the sideInput
> > > corresponding to that element is present"
> > >
> > > I don't think this is correct. Runners will evaluate a DoFn with side
> > > inputs on elements in a given window only after all side inputs are
> ready
> > > (have triggered at least once) in this window, so your code should be
> safe.
> > > However, runners will not rerun the DoFn with side inputs on subsequent
> > > triggerings of the side inputs, so you won't be able to update the
> results.
> >
> > Yes, but the second or third time an element falling into a given window
> is processed by the Joiner DoFn the side input may not be up-to-date with
> these new elements, so the side-input having triggered at least once is not
> a guarantee it is up to date.
> >
> > On 2019/10/10 18:35:21, Robert Bradshaw  wrote:
> >
> > > Time: 00:08:00
> > > Input: 
> > Output: 
> >
> > >
> > > Time: 00:13:00
> > > Input: 
> >
> > Output:  // average 4 & 6
> >
> > >
> > > Time: 00:00:00
> > > Input: 
> >
> > Output:  // average 1
> > >
> > > Time: 00:02:00
> > > Input: 
> >
> > Output:  // average 1 & 2
> >
> > I'd say the least surprising result here is that the aggregate includes
> the best available information at the time of processing. So yes it is
> sensitive to the order of arrival, that's unavoidable I think.
> >
> > >
> > > Are you really trying to emit elements with the mean of all elements
> > > with timestamp up to 10 minutes prior to the current value? That's a
> > > bit different than sliding windows. In that a case you could do
> > > something with a Stateful DoFn that buffers elements and for each
> > > incoming element sets a timer at T which then reads the buffer,
> > > computes the output, and discards elements older than 10 minutes. You
> > > could also possibly do this with a custom WindowFn.
> > >
> >
> > Yes the requirement is basically to enrich an event stream with values
> computed over arbitrary other event streams (including the event stream
> being enriched) and to do this with as low latency as possible.
> >
> > Of course the values derived from other event streams might not be
> included even if they occur before the event being enriched (even if
> "before" is in both the event time and processing time sense). But this is
> easier to swallow because theres no obvious causal dependency between that
> aggregate value and the event being enriched.
> >
> > .. I hope that made sense
>


Re: Joining PCollections to aggregates of themselves

2019-10-11 Thread Robert Bradshaw
OK, the only way to do this would be via a non-determanistic stateful
DoFn that buffers elements as they come in and computes averages by
looking at the buffer each time.

This could also be represented with an extension to window merging and
a join, where the trigger would be explicitly used to control the
balance between latency and correctness.

On Fri, Oct 11, 2019 at 8:01 AM Sam Stephens  wrote:
>
> On 2019/10/10 18:23:46, Eugene Kirpichov  wrote:
> > " input elements can pass through the Joiner DoFn before the sideInput
> > corresponding to that element is present"
> >
> > I don't think this is correct. Runners will evaluate a DoFn with side
> > inputs on elements in a given window only after all side inputs are ready
> > (have triggered at least once) in this window, so your code should be safe.
> > However, runners will not rerun the DoFn with side inputs on subsequent
> > triggerings of the side inputs, so you won't be able to update the results.
>
> Yes, but the second or third time an element falling into a given window is 
> processed by the Joiner DoFn the side input may not be up-to-date with these 
> new elements, so the side-input having triggered at least once is not a 
> guarantee it is up to date.
>
> On 2019/10/10 18:35:21, Robert Bradshaw  wrote:
>
> > Time: 00:08:00
> > Input: 
> Output: 
>
> >
> > Time: 00:13:00
> > Input: 
>
> Output:  // average 4 & 6
>
> >
> > Time: 00:00:00
> > Input: 
>
> Output:  // average 1
> >
> > Time: 00:02:00
> > Input: 
>
> Output:  // average 1 & 2
>
> I'd say the least surprising result here is that the aggregate includes the 
> best available information at the time of processing. So yes it is sensitive 
> to the order of arrival, that's unavoidable I think.
>
> >
> > Are you really trying to emit elements with the mean of all elements
> > with timestamp up to 10 minutes prior to the current value? That's a
> > bit different than sliding windows. In that a case you could do
> > something with a Stateful DoFn that buffers elements and for each
> > incoming element sets a timer at T which then reads the buffer,
> > computes the output, and discards elements older than 10 minutes. You
> > could also possibly do this with a custom WindowFn.
> >
>
> Yes the requirement is basically to enrich an event stream with values 
> computed over arbitrary other event streams (including the event stream being 
> enriched) and to do this with as low latency as possible.
>
> Of course the values derived from other event streams might not be included 
> even if they occur before the event being enriched (even if "before" is in 
> both the event time and processing time sense). But this is easier to swallow 
> because theres no obvious causal dependency between that aggregate value and 
> the event being enriched.
>
> .. I hope that made sense


Re: ETL with Beam?

2019-10-11 Thread Robert Bradshaw
These can be externalized as PTransforms. E.g. the generic ETL
pipeline could just be written

pipeline
.appy(SomeExtractPTransform())  // aka Source
.apply(SomeTransformPTransform())
.apply(SomeLoadPTransform())  // aka Sink

Any and all of these PTransforms may be composite (i.e .composed of
smaller transforms). But perhaps I'm not quite following what you're
trying to say.

On Fri, Oct 11, 2019 at 11:11 AM Steve973  wrote:
>
> The real benefit of a good ETL framework is being able to externalize your 
> extraction and transformation mappings.  If I didn't have to write that part, 
> that would be really cool!
>
> On Fri, Oct 11, 2019 at 1:28 PM Robert Bradshaw  wrote:
>>
>> I would like to call out that Beam itself can be directly used for
>> ETL, no extra framework required (not to say that both of these
>> frameworks don't provide additional value, e.g. GUI-style construction
>> of pipelines).
>>
>>
>> On Fri, Oct 11, 2019 at 9:29 AM Ryan Skraba  wrote:
>> >
>> > Hello!  Talend has a big data ETL product in the cloud called Pipeline
>> > Designer, entirely powered by Beam.  There was a talk at Beam Summit
>> > 2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately
>> > the live demo wasn't captured in the video.  You can find other videos
>> > of Pipeline Designer online to see if it might fit your needs, and
>> > there is a free trial!  Depending on how your work project is
>> > oriented, it may be of interest.
>> >
>> > Best regards, Ryan
>> >
>> > On Fri, Oct 11, 2019 at 12:26 PM Steve973  wrote:
>> > >
>> > > Thank you for your reply.  I will check it out!  I'm in the evaluation 
>> > > phase, especially since I have some time before I have to implement all 
>> > > of this.
>> > >
>> > > On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:
>> > >>
>> > >> I'm not sure if this will help but kettle runs on beam too.
>> > >>
>> > >> https://github.com/mattcasters/kettle-beam
>> > >>
>> > >> https://youtu.be/vgpGrQJnqkM
>> > >>
>> > >> Depends on your use case but kettle rocks for etl.
>> > >>
>> > >> Dan
>> > >>
>> > >> Sent from my phone
>> > >>
>> > >> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
>> > >>>
>> > >>> Hello, all.  I still have not been given the tasking to convert my 
>> > >>> work project to use Beam, but it is still something that I am looking 
>> > >>> to do in the fairly near future.  Our data workflow consists of ingest 
>> > >>> and transformation, and I was hoping that there are ETL frameworks 
>> > >>> that work well with Beam.  Does anyone have some recommendations and 
>> > >>> maybe some samples that show how people might use and ETL framework 
>> > >>> with Beam?
>> > >>>
>> > >>> Thanks in advance and have a great day!


Re: ETL with Beam?

2019-10-11 Thread Steve973
The real benefit of a good ETL framework is being able to externalize your
extraction and transformation mappings.  If I didn't have to write that
part, that would be really cool!

On Fri, Oct 11, 2019 at 1:28 PM Robert Bradshaw  wrote:

> I would like to call out that Beam itself can be directly used for
> ETL, no extra framework required (not to say that both of these
> frameworks don't provide additional value, e.g. GUI-style construction
> of pipelines).
>
>
> On Fri, Oct 11, 2019 at 9:29 AM Ryan Skraba  wrote:
> >
> > Hello!  Talend has a big data ETL product in the cloud called Pipeline
> > Designer, entirely powered by Beam.  There was a talk at Beam Summit
> > 2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately
> > the live demo wasn't captured in the video.  You can find other videos
> > of Pipeline Designer online to see if it might fit your needs, and
> > there is a free trial!  Depending on how your work project is
> > oriented, it may be of interest.
> >
> > Best regards, Ryan
> >
> > On Fri, Oct 11, 2019 at 12:26 PM Steve973  wrote:
> > >
> > > Thank you for your reply.  I will check it out!  I'm in the evaluation
> phase, especially since I have some time before I have to implement all of
> this.
> > >
> > > On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:
> > >>
> > >> I'm not sure if this will help but kettle runs on beam too.
> > >>
> > >> https://github.com/mattcasters/kettle-beam
> > >>
> > >> https://youtu.be/vgpGrQJnqkM
> > >>
> > >> Depends on your use case but kettle rocks for etl.
> > >>
> > >> Dan
> > >>
> > >> Sent from my phone
> > >>
> > >> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
> > >>>
> > >>> Hello, all.  I still have not been given the tasking to convert my
> work project to use Beam, but it is still something that I am looking to do
> in the fairly near future.  Our data workflow consists of ingest and
> transformation, and I was hoping that there are ETL frameworks that work
> well with Beam.  Does anyone have some recommendations and maybe some
> samples that show how people might use and ETL framework with Beam?
> > >>>
> > >>> Thanks in advance and have a great day!
>


Re: ETL with Beam?

2019-10-11 Thread Robert Bradshaw
I would like to call out that Beam itself can be directly used for
ETL, no extra framework required (not to say that both of these
frameworks don't provide additional value, e.g. GUI-style construction
of pipelines).


On Fri, Oct 11, 2019 at 9:29 AM Ryan Skraba  wrote:
>
> Hello!  Talend has a big data ETL product in the cloud called Pipeline
> Designer, entirely powered by Beam.  There was a talk at Beam Summit
> 2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately
> the live demo wasn't captured in the video.  You can find other videos
> of Pipeline Designer online to see if it might fit your needs, and
> there is a free trial!  Depending on how your work project is
> oriented, it may be of interest.
>
> Best regards, Ryan
>
> On Fri, Oct 11, 2019 at 12:26 PM Steve973  wrote:
> >
> > Thank you for your reply.  I will check it out!  I'm in the evaluation 
> > phase, especially since I have some time before I have to implement all of 
> > this.
> >
> > On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:
> >>
> >> I'm not sure if this will help but kettle runs on beam too.
> >>
> >> https://github.com/mattcasters/kettle-beam
> >>
> >> https://youtu.be/vgpGrQJnqkM
> >>
> >> Depends on your use case but kettle rocks for etl.
> >>
> >> Dan
> >>
> >> Sent from my phone
> >>
> >> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
> >>>
> >>> Hello, all.  I still have not been given the tasking to convert my work 
> >>> project to use Beam, but it is still something that I am looking to do in 
> >>> the fairly near future.  Our data workflow consists of ingest and 
> >>> transformation, and I was hoping that there are ETL frameworks that work 
> >>> well with Beam.  Does anyone have some recommendations and maybe some 
> >>> samples that show how people might use and ETL framework with Beam?
> >>>
> >>> Thanks in advance and have a great day!


Re: ETL with Beam?

2019-10-11 Thread Ryan Skraba
Hello!  Talend has a big data ETL product in the cloud called Pipeline
Designer, entirely powered by Beam.  There was a talk at Beam Summit
2018 (https://www.youtube.com/watch?v=1AlEGUtiQek), but unfortunately
the live demo wasn't captured in the video.  You can find other videos
of Pipeline Designer online to see if it might fit your needs, and
there is a free trial!  Depending on how your work project is
oriented, it may be of interest.

Best regards, Ryan

On Fri, Oct 11, 2019 at 12:26 PM Steve973  wrote:
>
> Thank you for your reply.  I will check it out!  I'm in the evaluation phase, 
> especially since I have some time before I have to implement all of this.
>
> On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:
>>
>> I'm not sure if this will help but kettle runs on beam too.
>>
>> https://github.com/mattcasters/kettle-beam
>>
>> https://youtu.be/vgpGrQJnqkM
>>
>> Depends on your use case but kettle rocks for etl.
>>
>> Dan
>>
>> Sent from my phone
>>
>> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
>>>
>>> Hello, all.  I still have not been given the tasking to convert my work 
>>> project to use Beam, but it is still something that I am looking to do in 
>>> the fairly near future.  Our data workflow consists of ingest and 
>>> transformation, and I was hoping that there are ETL frameworks that work 
>>> well with Beam.  Does anyone have some recommendations and maybe some 
>>> samples that show how people might use and ETL framework with Beam?
>>>
>>> Thanks in advance and have a great day!


Re: Limited join with stop condition

2019-10-11 Thread Alexey Romanenko
Many thanks for your ideas, everybody, I really appreciate it. I’m going to 
play with Stateful DoFn and see if it will work for us.

> And I have to ask, though, can you build indices instead of brute force for 
> the join?
Answering your question, Kenn. Yes, potentially, we can build indices for this 
case and use them for look-ups but it will take time (since initial sources are 
just files in S3) and initial goal was to have fast and generic solution for 
different sources. Also, I think we can sacrifice parallelism since the amount 
of processing data should not be huge and final output is relatively small.

In the same time, this use case and another recent KinesisIO issue brought me 
to thinking about effective solution for such request. In the end, it could be 
used, for example, for dynamic Back pressure. Afaik, we don’t have such option 
in Beam and "Read IOs" usually use eager strategy to read data from source as 
much as possible. Potentially, it can cause NPE if input buffers are not 
limited by size, but it still doesn’t take into account the downstream 
throughput. For instance, in pure Spark streaming jobs Back pressure can be 
configured but I doubt it will work with SparkRunner in Beam since Beam has own 
implementation of IO connectors. 
So, I’m wondering about your thoughts if this feature could be useful and if it 
should be integrated in Beam?

> On 11 Oct 2019, at 06:29, Reza Rokni  wrote:
> 
> Hi,
> 
> Agreed with the others that this does not sound like a good fit... 
> 
> But to explore ideas... One possible (complicated and error prone) way this 
> could be done, ...
> 
> Beam does not support cycles, but you could use an external unbounded source 
> as a way of sending impulse out and then back into the system to read more 
> data; 
> 
> Assuming you are not using standard Sources IO's and your reading data via a 
> DoFn ( it would not work with the inbuilt Source IO's) :
> Create a streaming pipeline that reads from an unbounded source, this source 
> is just used for signals to read more data.
> You start the initial read by sending a Start event to the unbounded source
> In the pipeline you branch the start event to two DoFns, DoFnReadFromSource1 
> and DoFnReadFromSource2. These will each read X records, which are then 
> warped in an Event object and sent forward. You will also need to have 
> sequence id's and an EndRead Event object ( in case a source has been 
> exhausted) . 
> You send the events to a Stateful DoFn (in global window) which does the 
> following: 
> If Condition not met, send a Start event message back to the unbounded source 
> ( which will result in more data read ) 
> If Condition is met, send out the joined event and GC data that has been 
> joined. 
> Keep the other elements around for the next time you send a start event into 
> the unbounded source. 
> I am sure there are many corner cases I have not thought of ... ( for example 
> when both sources are exhausted and you dont have a join condition match, 
> what should it do..) . Also this will result in a pipeline that is always up 
> and running. 
> 
> Cheers
> Reza
> 
>   
> 
> 
> On Fri, 11 Oct 2019 at 11:19, Kenneth Knowles  > wrote:
> Interesting! I agree with Luke that it seems not a great fit for Beam in the 
> most rigorous sense. There are many considerations:
> 
> 1. We assume ParDo has side effects by default. So the model actual 
> *requires* eager evaluation, not lazy, in order to make all the side effects 
> happen. But for your case let us assume somehow we know it is all @Pure.
> 2. Lazy evaluation and parallelism are in opposition. In pure computations 
> like Haskell, literally everything (except monadic sequence) is parallel for 
> free, but the problem is nothing starts until it is needed so parallelism 
> requires forcing computations early.
> 
> On the other hand, we can think about ways forward here. A first step is if 
> the join is a "side lookup join" where we always process all of source 1 but 
> try to process less of source 2. If source 2 is feeding into a map side input 
> then this could be lazy in some way. When an element from source 1 calls the 
> side input lookup it could be a blocking call that triggers reads from source 
> 2 until a match is found. This computation strategy is consistent with the 
> model and will read all of source 1 but only the prefix of source 2 needed to 
> join all of source 1. I think you could implement this pattern with 
> parallelism on both the main input and side input. Then, to read less of 
> source 1 you need feedback from the sink to the source. We have nothing like 
> that... This is all very abstract hypotheticals.
> 
> If we get to practical implementation "today" then every runner pretty much 
> reads all of a bounded source before even starting the next transform, no?. I 
> wonder if it makes sense to convert them to unbounded (which is still allowed 
> to terminate but does 

Re: Joining PCollections to aggregates of themselves

2019-10-11 Thread Sam Stephens
On 2019/10/10 18:23:46, Eugene Kirpichov  wrote: 
> " input elements can pass through the Joiner DoFn before the sideInput
> corresponding to that element is present"
> 
> I don't think this is correct. Runners will evaluate a DoFn with side
> inputs on elements in a given window only after all side inputs are ready
> (have triggered at least once) in this window, so your code should be safe.
> However, runners will not rerun the DoFn with side inputs on subsequent
> triggerings of the side inputs, so you won't be able to update the results.

Yes, but the second or third time an element falling into a given window is 
processed by the Joiner DoFn the side input may not be up-to-date with these 
new elements, so the side-input having triggered at least once is not a 
guarantee it is up to date.

On 2019/10/10 18:35:21, Robert Bradshaw  wrote: 

> Time: 00:08:00
> Input: 
Output: 

> 
> Time: 00:13:00
> Input: 

Output:  // average 4 & 6

> 
> Time: 00:00:00
> Input: 

Output:  // average 1
> 
> Time: 00:02:00
> Input: 

Output:  // average 1 & 2

I'd say the least surprising result here is that the aggregate includes the 
best available information at the time of processing. So yes it is sensitive to 
the order of arrival, that's unavoidable I think.

> 
> Are you really trying to emit elements with the mean of all elements
> with timestamp up to 10 minutes prior to the current value? That's a
> bit different than sliding windows. In that a case you could do
> something with a Stateful DoFn that buffers elements and for each
> incoming element sets a timer at T which then reads the buffer,
> computes the output, and discards elements older than 10 minutes. You
> could also possibly do this with a custom WindowFn.
> 

Yes the requirement is basically to enrich an event stream with values computed 
over arbitrary other event streams (including the event stream being enriched) 
and to do this with as low latency as possible. 

Of course the values derived from other event streams might not be included 
even if they occur before the event being enriched (even if "before" is in both 
the event time and processing time sense). But this is easier to swallow 
because theres no obvious causal dependency between that aggregate value and 
the event being enriched.

.. I hope that made sense


Re: Feedback on how we use Apache Beam in my company

2019-10-11 Thread Pierre Vanacker
Nice, thanks. I just registered, see you there !

Pierre

De : Alexey Romanenko 
Répondre à : "user@beam.apache.org" 
Date : vendredi 11 octobre 2019 à 15:29
À : "user@beam.apache.org" 
Cc : dev 
Objet : Re: Feedback on how we use Apache Beam in my company

Hi Pierre,

If you are in Paris region (I can guess because of Dailymotion name =) ) then 
it would be great to chat about that at next (2nd) Paris Beam meetup, which 
will be held very soon, October 17th.
https://www.meetup.com/Paris-Apache-Beam-Meetup/events/264545288/



On 11 Oct 2019, at 14:58, Pierre Vanacker 
mailto:pierre.vanac...@dailymotion.com>> wrote:

Thanks Etienne & Matthias !

Why not, it kinda depends on the location :) What meetup / summit do you have 
in mind ?

Pierre

De : Matthias Baetens 
mailto:baetensmatth...@gmail.com>>
Répondre à : "user@beam.apache.org" 
mailto:user@beam.apache.org>>
Date : vendredi 11 octobre 2019 à 09:45
À : "user@beam.apache.org" 
mailto:user@beam.apache.org>>
Cc : dev mailto:d...@beam.apache.org>>
Objet : Re: Feedback on how we use Apache Beam in my company

This is great, Pierre! Thank you for sharing, very interesting.

Would you and your team be interested to talk about your use case at a meetup 
(or Summit) in the future? :)

All the best,
Matthias

On Wed, 9 Oct 2019 at 15:59, Etienne Chauchot 
mailto:echauc...@apache.org>> wrote:
Very nice !
Thanks
ccing dev list
Etienne
On 09/10/2019 16:55, Pierre Vanacker wrote:
Hi Apache Beam community,

We’ve been working with Apache Beam in production for a few years now in my 
company (Dailymotion).

If you’re interested to know how we use Apache Beam in combination with Google 
Dataflow, we shared this experience in the following article : 
https://medium.com/dailymotion/realtime-data-processing-with-apache-beam-and-google-dataflow-at-dailymotion-7d1b994dc816

Thanks to the developers for your great work !

Regards,

Pierre



Re: Feedback on how we use Apache Beam in my company

2019-10-11 Thread Alexey Romanenko
Hi Pierre,

If you are in Paris region (I can guess because of Dailymotion name =) ) then 
it would be great to chat about that at next (2nd) Paris Beam meetup, which 
will be held very soon, October 17th.
https://www.meetup.com/Paris-Apache-Beam-Meetup/events/264545288/ 



> On 11 Oct 2019, at 14:58, Pierre Vanacker  
> wrote:
> 
> Thanks Etienne & Matthias !
>  
> Why not, it kinda depends on the location :) What meetup / summit do you have 
> in mind ?
>  
> Pierre
>  
> De : Matthias Baetens 
> Répondre à : "user@beam.apache.org" 
> Date : vendredi 11 octobre 2019 à 09:45
> À : "user@beam.apache.org" 
> Cc : dev 
> Objet : Re: Feedback on how we use Apache Beam in my company
>  
> This is great, Pierre! Thank you for sharing, very interesting.
>  
> Would you and your team be interested to talk about your use case at a meetup 
> (or Summit) in the future? :)
>  
> All the best,
> Matthias
>  
> On Wed, 9 Oct 2019 at 15:59, Etienne Chauchot  > wrote:
> Very nice !
> 
> Thanks
> 
> ccing dev list
> 
> Etienne
> 
> On 09/10/2019 16:55, Pierre Vanacker wrote:
> Hi Apache Beam community,
>  
> We’ve been working with Apache Beam in production for a few years now in my 
> company (Dailymotion).
>  
> If you’re interested to know how we use Apache Beam in combination with 
> Google Dataflow, we shared this experience in the following article : 
> https://medium.com/dailymotion/realtime-data-processing-with-apache-beam-and-google-dataflow-at-dailymotion-7d1b994dc816
>  
> 
>  
> Thanks to the developers for your great work !
>  
> Regards,
>  
> Pierre



Re: Feedback on how we use Apache Beam in my company

2019-10-11 Thread Pierre Vanacker
Thanks Etienne & Matthias !

Why not, it kinda depends on the location :) What meetup / summit do you have 
in mind ?

Pierre

De : Matthias Baetens 
Répondre à : "user@beam.apache.org" 
Date : vendredi 11 octobre 2019 à 09:45
À : "user@beam.apache.org" 
Cc : dev 
Objet : Re: Feedback on how we use Apache Beam in my company

This is great, Pierre! Thank you for sharing, very interesting.

Would you and your team be interested to talk about your use case at a meetup 
(or Summit) in the future? :)

All the best,
Matthias

On Wed, 9 Oct 2019 at 15:59, Etienne Chauchot 
mailto:echauc...@apache.org>> wrote:

Very nice !

Thanks

ccing dev list

Etienne
On 09/10/2019 16:55, Pierre Vanacker wrote:
Hi Apache Beam community,

We’ve been working with Apache Beam in production for a few years now in my 
company (Dailymotion).

If you’re interested to know how we use Apache Beam in combination with Google 
Dataflow, we shared this experience in the following article : 
https://medium.com/dailymotion/realtime-data-processing-with-apache-beam-and-google-dataflow-at-dailymotion-7d1b994dc816

Thanks to the developers for your great work !

Regards,

Pierre


Re: ETL with Beam?

2019-10-11 Thread Steve973
Thank you for your reply.  I will check it out!  I'm in the evaluation
phase, especially since I have some time before I have to implement all of
this.

On Fri, Oct 11, 2019 at 3:25 AM Dan  wrote:

> I'm not sure if this will help but kettle runs on beam too.
>
> https://github.com/mattcasters/kettle-beam
>
> https://youtu.be/vgpGrQJnqkM
>
> Depends on your use case but kettle rocks for etl.
>
> Dan
>
> Sent from my phone
>
> On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:
>
>> Hello, all.  I still have not been given the tasking to convert my work
>> project to use Beam, but it is still something that I am looking to do in
>> the fairly near future.  Our data workflow consists of ingest and
>> transformation, and I was hoping that there are ETL frameworks that work
>> well with Beam.  Does anyone have some recommendations and maybe some
>> samples that show how people might use and ETL framework with Beam?
>>
>> Thanks in advance and have a great day!
>>
>


Re: Feedback on how we use Apache Beam in my company

2019-10-11 Thread Alex Van Boxel
Great writeup. You can add an additional benefit of docker vs templates:
Dynamically reconfigure/rebuild your pipelines from external parameters
(example arguments), iso only using the Value placeholders.

 _/
_/ Alex Van Boxel


On Wed, Oct 9, 2019 at 4:55 PM Pierre Vanacker <
pierre.vanac...@dailymotion.com> wrote:

> Hi Apache Beam community,
>
>
>
> We’ve been working with Apache Beam in production for a few years now in
> my company (Dailymotion).
>
>
>
> If you’re interested to know how we use Apache Beam in combination with
> Google Dataflow, we shared this experience in the following article :
> https://medium.com/dailymotion/realtime-data-processing-with-apache-beam-and-google-dataflow-at-dailymotion-7d1b994dc816
>
>
>
> Thanks to the developers for your great work !
>
>
>
> Regards,
>
>
>
> Pierre
>


Re: Feedback on how we use Apache Beam in my company

2019-10-11 Thread Matthias Baetens
This is great, Pierre! Thank you for sharing, very interesting.

Would you and your team be interested to talk about your use case at a
meetup (or Summit) in the future? :)

All the best,
Matthias

On Wed, 9 Oct 2019 at 15:59, Etienne Chauchot  wrote:

> Very nice !
>
> Thanks
>
> ccing dev list
>
> Etienne
> On 09/10/2019 16:55, Pierre Vanacker wrote:
>
> Hi Apache Beam community,
>
>
>
> We’ve been working with Apache Beam in production for a few years now in
> my company (Dailymotion).
>
>
>
> If you’re interested to know how we use Apache Beam in combination with
> Google Dataflow, we shared this experience in the following article :
> https://medium.com/dailymotion/realtime-data-processing-with-apache-beam-and-google-dataflow-at-dailymotion-7d1b994dc816
>
>
>
> Thanks to the developers for your great work !
>
>
>
> Regards,
>
>
>
> Pierre
>
>


Re: ETL with Beam?

2019-10-11 Thread Dan
I'm not sure if this will help but kettle runs on beam too.

https://github.com/mattcasters/kettle-beam

https://youtu.be/vgpGrQJnqkM

Depends on your use case but kettle rocks for etl.

Dan

Sent from my phone

On Thu, 10 Oct 2019, 10:12 pm Steve973,  wrote:

> Hello, all.  I still have not been given the tasking to convert my work
> project to use Beam, but it is still something that I am looking to do in
> the fairly near future.  Our data workflow consists of ingest and
> transformation, and I was hoping that there are ETL frameworks that work
> well with Beam.  Does anyone have some recommendations and maybe some
> samples that show how people might use and ETL framework with Beam?
>
> Thanks in advance and have a great day!
>