Consultant needed in Los Angeles - Google Dataflow / Apache Beam model

2017-12-01 Thread mjo...@techlink.org
Hi, I am looking for two Data Engineers with Google Cloud experience that are 
open for contract work (onsite) in Los Angeles at a large media company.

The job requires professional experience designing, building & maintaining data 
systems and processes using cloud-based platforms (Google BigQuery and Cloud 
DataFlow extremely desirable; AirFlow is a big plus), including experience 
working in Unix/Linux operating systems and tools.
Expertise using cloud-based systems and services to acquire and deliver data 
via APIs and flat files
Hands-on professional software development skills using Java.  
Extensive hand-on experience working with data using SQL.

Please contact me if there is any interest. This is a great opportunity.

Mike
310-566-7155



Re: Callbacks/other functions run after a PDone/output transform

2017-12-01 Thread Chet Aldrich
So I agree generally with the idea that returning a PCollection makes all of 
this easier so that arbitrary additional functions can be added, what exactly 
would write functions be returning in a PCollection that would make sense? The 
whole idea is that we’ve written to an external source and now the collection 
itself is no longer needed. 

Currently, that’s represented with a PDone, but currently that doesn’t allow 
any work to occur after it. I see a couple possible ways of handling this given 
this conversation, and am curious which solution sounds like the best way to 
deal with the problem:

1. Have output transforms always return something specific (which would be the 
same across transforms by convention), that is in the form of a PCollection, so 
operations can occur after it. 

2. Make either PDone or some new type that can act as a PCollection so we can 
run applies afterward. 

3. Make output transforms provide the facility for a callback function which 
runs after the transform is complete.

I went through these gymnastics recently when I was trying to build something 
that would move indices after writing to Algolia, and the solution was to 
co-opt code from the old Sink class that used to exist in Beam. The problem is 
that particular method requires the output transform in question
to return a PCollection, even if it is trivial or doesn’t make sense to return 
one. This seems like a bad solution, but unfortunately there isn’t a notion of 
a transform that has no explicit output that needs to have operations occur 
after it. 

The three potential solutions above address this issue, but I would like to 
hear on which would be preferable (or perhaps a different proposal 
altogether?). Perhaps we could also start up a ticket on this, since it seems 
like a worthwhile feature addition. I would find it really useful, for one. 

Chet

> On Dec 1, 2017, at 12:19 PM, Lukasz Cwik  wrote:
> 
> Instead of a callback fn, its most useful if a PCollection is returned 
> containing the result of the sink so that any arbitrary additional functions 
> can be applied.
> 
> On Fri, Dec 1, 2017 at 7:14 AM, Jean-Baptiste Onofré  > wrote:
> Agree, I would prefer to do the callback in the IO more than in the main.
> 
> Regards
> JB
> 
> On 12/01/2017 03:54 PM, Steve Niemitz wrote:
> I do something almost exactly like this, but with BigtableIO instead.  I have 
> a pull request open here [1] (which reminds me I need to finish this up...).  
> It would really be nice for most IOs to support something like this.
> 
> Essentially you do a GroupByKey (or some CombineFn) on the output from the 
> BigtableIO, and then feed that into your function which will run when all 
> writes finish.
> 
> You probably want to avoid doing something in the main method because there's 
> no guarantee it'll actually run (maybe the driver will die, get killed, 
> machine will explode, etc).
> 
> [1] https://github.com/apache/beam/pull/3997 
> 
> 
> On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick    >> wrote:
> 
> Assuming you're in Java. You could just follow on in your Main method.
> Checking the state of the Result.
> 
> Example:
> PipelineResult result = pipeline.run();
> try {
> result.waitUntilFinish();
> if(result.getState() == PipelineResult.State.DONE) {
> //DO ES work
> }
> } catch(Exception e) {
> result.cancel();
> throw e;
> }
> 
> Otherwise you could also use Oozie to construct a work flow.
> 
> On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré  
> >> wrote:
> 
> Hi,
> 
> yes, we had a similar question some days ago.
> 
> We can imagine to have a user callback fn fired when the sink batch is
> complete.
> 
> Let me think about that.
> 
> Regards
> JB
> 
> On 12/01/2017 09:04 AM, Philip Chan wrote:
> 
> Hey JB,
> 
> Thanks for getting back so quickly.
> I suppose in that case I would need a way of monitoring when the 
> ES
> transform completes successfully before I can proceed with doing 
> the
> swap.
> The problem with this is that I can't think of a good way to
> determine that termination state short of polling the new index to
> check the document count compared to the size of input 
> PCollection.
> That, or maybe I'd need to use an external system like you 
> mentioned
> to poll on the state of the pipeline (I'm using Google Dataflow, 
> so
> maybe there's a way to do this with some API).
> But I would have thought that there would be an easy 

Re:

2017-12-01 Thread Lukasz Cwik
Instead of a callback fn, its most useful if a PCollection is returned
containing the result of the sink so that any arbitrary additional
functions can be applied.

On Fri, Dec 1, 2017 at 7:14 AM, Jean-Baptiste Onofré 
wrote:

> Agree, I would prefer to do the callback in the IO more than in the main.
>
> Regards
> JB
>
> On 12/01/2017 03:54 PM, Steve Niemitz wrote:
>
>> I do something almost exactly like this, but with BigtableIO instead.  I
>> have a pull request open here [1] (which reminds me I need to finish this
>> up...).  It would really be nice for most IOs to support something like
>> this.
>>
>> Essentially you do a GroupByKey (or some CombineFn) on the output from
>> the BigtableIO, and then feed that into your function which will run when
>> all writes finish.
>>
>> You probably want to avoid doing something in the main method because
>> there's no guarantee it'll actually run (maybe the driver will die, get
>> killed, machine will explode, etc).
>>
>> [1] https://github.com/apache/beam/pull/3997
>>
>> On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick  nerdyn...@gmail.com>> wrote:
>>
>> Assuming you're in Java. You could just follow on in your Main method.
>> Checking the state of the Result.
>>
>> Example:
>> PipelineResult result = pipeline.run();
>> try {
>> result.waitUntilFinish();
>> if(result.getState() == PipelineResult.State.DONE) {
>> //DO ES work
>> }
>> } catch(Exception e) {
>> result.cancel();
>> throw e;
>> }
>>
>> Otherwise you could also use Oozie to construct a work flow.
>>
>> On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi,
>>
>> yes, we had a similar question some days ago.
>>
>> We can imagine to have a user callback fn fired when the sink
>> batch is
>> complete.
>>
>> Let me think about that.
>>
>> Regards
>> JB
>>
>> On 12/01/2017 09:04 AM, Philip Chan wrote:
>>
>> Hey JB,
>>
>> Thanks for getting back so quickly.
>> I suppose in that case I would need a way of monitoring when
>> the ES
>> transform completes successfully before I can proceed with
>> doing the
>> swap.
>> The problem with this is that I can't think of a good way to
>> determine that termination state short of polling the new
>> index to
>> check the document count compared to the size of input
>> PCollection.
>> That, or maybe I'd need to use an external system like you
>> mentioned
>> to poll on the state of the pipeline (I'm using Google
>> Dataflow, so
>> maybe there's a way to do this with some API).
>> But I would have thought that there would be an easy way of
>> simply
>> saying "do not process this transform until this other
>> transform
>> completes".
>> Is there no established way of "signaling" between pipelines
>> when
>> some pipeline completes, or have some way of declaring a
>> dependency
>> of 1 transform on another transform?
>>
>> Thanks again,
>> Philip
>>
>> On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré
>>  > j...@nanthrax.net
>>
>> >> wrote:
>>
>>  Hi Philip,
>>
>>  You won't be able to do (3) in the same pipeline as the
>> Elasticsearch Sink
>>  PTransform ends the pipeline with PDone.
>>
>>  So, (3) has to be done in another pipeline (using a
>> DoFn) or in
>> another
>>  "system" (like Camel for instance). I would do a check
>> of the
>> data in the
>>  index and then trigger the swap there.
>>
>>  Regards
>>  JB
>>
>>  On 12/01/2017 08:41 AM, Philip Chan wrote:
>>
>>  Hi,
>>
>>  I'm pretty new to Beam, and I've been trying to use
>> the
>> ElasticSearchIO
>>  sink to write docs into ES.
>>  With this, I want to be able to
>>  1. ingest and transform rows from DB (done)
>>  2. write JSON docs/strings into a new ES index (done)
>>  3. After (2) is complete and all documents are
>> written into
>> a new index,
>>  trigger an atomic index swap under an alias to
>> replace the
>> current
>>  aliased index with the new index generated in step
>> 2. This
>> is basically
>>  a single POST request to the ES cluster.
>>
>>  The problem I'm facing is that I don't seem to be
>> able 

Re:

2017-12-01 Thread Jean-Baptiste Onofré

Agree, I would prefer to do the callback in the IO more than in the main.

Regards
JB

On 12/01/2017 03:54 PM, Steve Niemitz wrote:
I do something almost exactly like this, but with BigtableIO instead.  I have a 
pull request open here [1] (which reminds me I need to finish this up...).  It 
would really be nice for most IOs to support something like this.


Essentially you do a GroupByKey (or some CombineFn) on the output from the 
BigtableIO, and then feed that into your function which will run when all writes 
finish.


You probably want to avoid doing something in the main method because there's no 
guarantee it'll actually run (maybe the driver will die, get killed, machine 
will explode, etc).


[1] https://github.com/apache/beam/pull/3997

On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick > wrote:


Assuming you're in Java. You could just follow on in your Main method.
Checking the state of the Result.

Example:
PipelineResult result = pipeline.run();
try {
result.waitUntilFinish();
if(result.getState() == PipelineResult.State.DONE) {
//DO ES work
}
} catch(Exception e) {
result.cancel();
throw e;
}

Otherwise you could also use Oozie to construct a work flow.

On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré > wrote:

Hi,

yes, we had a similar question some days ago.

We can imagine to have a user callback fn fired when the sink batch is
complete.

Let me think about that.

Regards
JB

On 12/01/2017 09:04 AM, Philip Chan wrote:

Hey JB,

Thanks for getting back so quickly.
I suppose in that case I would need a way of monitoring when the ES
transform completes successfully before I can proceed with doing the
swap.
The problem with this is that I can't think of a good way to
determine that termination state short of polling the new index to
check the document count compared to the size of input PCollection.
That, or maybe I'd need to use an external system like you mentioned
to poll on the state of the pipeline (I'm using Google Dataflow, so
maybe there's a way to do this with some API).
But I would have thought that there would be an easy way of simply
saying "do not process this transform until this other transform
completes".
Is there no established way of "signaling" between pipelines when
some pipeline completes, or have some way of declaring a dependency
of 1 transform on another transform?

Thanks again,
Philip

On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré
 
>> wrote:

     Hi Philip,

     You won't be able to do (3) in the same pipeline as the
Elasticsearch Sink
     PTransform ends the pipeline with PDone.

     So, (3) has to be done in another pipeline (using a DoFn) or in
another
     "system" (like Camel for instance). I would do a check of the
data in the
     index and then trigger the swap there.

     Regards
     JB

     On 12/01/2017 08:41 AM, Philip Chan wrote:

         Hi,

         I'm pretty new to Beam, and I've been trying to use the
ElasticSearchIO
         sink to write docs into ES.
         With this, I want to be able to
         1. ingest and transform rows from DB (done)
         2. write JSON docs/strings into a new ES index (done)
         3. After (2) is complete and all documents are written into
a new index,
         trigger an atomic index swap under an alias to replace the
current
         aliased index with the new index generated in step 2. This
is basically
         a single POST request to the ES cluster.

         The problem I'm facing is that I don't seem to be able to
find a way to
         have a way for (3) to happen after step (2) is complete.

         The ElasticSearchIO.Write transform returns a PDone, and
I'm not sure
         how to proceed from there because it doesn't seem to let me
do another
         apply on it to "define" a dependency.

https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html


Re: Regarding Beam Slack Channel

2017-12-01 Thread Wesley Tanaka

both sent


On 12/01/2017 06:43 AM, Ziyad Muhammed wrote:

me too, thanks in advance

Best
Ziyad

On Fri, Dec 1, 2017 at 10:54 AM, > wrote:


Hi
Can I receive this invitation, too?

Thanks
Rick

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net
]
Sent: Friday, December 01, 2017 12:53 PM
To: user@beam.apache.org 
Subject: Re: Regarding Beam Slack Channel

Invite sent as well.

Regards
JB

On 11/30/2017 07:19 PM, Yanael Barbier wrote:
> Hello
> Can I get an invite too?
>
> Thanks,
> Yanael
>
> Le jeu. 30 nov. 2017 à 19:15, Wesley Tanaka

> >> a écrit :
>
>     Invite sent
>
>
>     On 11/30/2017 08:11 AM, Nalseez Duke wrote:
>>     Hello
>>
>>     Can someone please add me to the Beam slack channel?
>>
>>     Thanks.
>
>
>     --
>     Wesley Tanaka
> https://wtanaka.com/
>

--
Jean-Baptiste Onofré
jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com


--
本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。
This email may contain confidential information. Please do not use
or disclose it in any way and delete it if you are not the
intended recipient.




--
Wesley Tanaka
https://wtanaka.com/



Re:

2017-12-01 Thread Steve Niemitz
I do something almost exactly like this, but with BigtableIO instead.  I
have a pull request open here [1] (which reminds me I need to finish this
up...).  It would really be nice for most IOs to support something like
this.

Essentially you do a GroupByKey (or some CombineFn) on the output from the
BigtableIO, and then feed that into your function which will run when all
writes finish.

You probably want to avoid doing something in the main method because
there's no guarantee it'll actually run (maybe the driver will die, get
killed, machine will explode, etc).

[1] https://github.com/apache/beam/pull/3997

On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick  wrote:

> Assuming you're in Java. You could just follow on in your Main method.
> Checking the state of the Result.
>
> Example:
> PipelineResult result = pipeline.run();
> try {
> result.waitUntilFinish();
> if(result.getState() == PipelineResult.State.DONE) {
> //DO ES work
> }
> } catch(Exception e) {
> result.cancel();
> throw e;
> }
>
> Otherwise you could also use Oozie to construct a work flow.
>
> On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> yes, we had a similar question some days ago.
>>
>> We can imagine to have a user callback fn fired when the sink batch is
>> complete.
>>
>> Let me think about that.
>>
>> Regards
>> JB
>>
>> On 12/01/2017 09:04 AM, Philip Chan wrote:
>>
>>> Hey JB,
>>>
>>> Thanks for getting back so quickly.
>>> I suppose in that case I would need a way of monitoring when the ES
>>> transform completes successfully before I can proceed with doing the swap.
>>> The problem with this is that I can't think of a good way to determine
>>> that termination state short of polling the new index to check the document
>>> count compared to the size of input PCollection.
>>> That, or maybe I'd need to use an external system like you mentioned to
>>> poll on the state of the pipeline (I'm using Google Dataflow, so maybe
>>> there's a way to do this with some API).
>>> But I would have thought that there would be an easy way of simply
>>> saying "do not process this transform until this other transform completes".
>>> Is there no established way of "signaling" between pipelines when some
>>> pipeline completes, or have some way of declaring a dependency of 1
>>> transform on another transform?
>>>
>>> Thanks again,
>>> Philip
>>>
>>> On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré >> > wrote:
>>>
>>> Hi Philip,
>>>
>>> You won't be able to do (3) in the same pipeline as the
>>> Elasticsearch Sink
>>> PTransform ends the pipeline with PDone.
>>>
>>> So, (3) has to be done in another pipeline (using a DoFn) or in
>>> another
>>> "system" (like Camel for instance). I would do a check of the data
>>> in the
>>> index and then trigger the swap there.
>>>
>>> Regards
>>> JB
>>>
>>> On 12/01/2017 08:41 AM, Philip Chan wrote:
>>>
>>> Hi,
>>>
>>> I'm pretty new to Beam, and I've been trying to use the
>>> ElasticSearchIO
>>> sink to write docs into ES.
>>> With this, I want to be able to
>>> 1. ingest and transform rows from DB (done)
>>> 2. write JSON docs/strings into a new ES index (done)
>>> 3. After (2) is complete and all documents are written into a
>>> new index,
>>> trigger an atomic index swap under an alias to replace the
>>> current
>>> aliased index with the new index generated in step 2. This is
>>> basically
>>> a single POST request to the ES cluster.
>>>
>>> The problem I'm facing is that I don't seem to be able to find a
>>> way to
>>> have a way for (3) to happen after step (2) is complete.
>>>
>>> The ElasticSearchIO.Write transform returns a PDone, and I'm not
>>> sure
>>> how to proceed from there because it doesn't seem to let me do
>>> another
>>> apply on it to "define" a dependency.
>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org
>>> /apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>> >> g/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>
>>> >> g/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>> >> g/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>>
>>>
>>> Is there a recommended way to construct pipelines workflows like
>>> this?
>>>
>>> Thanks in advance,
>>> Philip
>>>
>>>
>>> -- Jean-Baptiste Onofré
>>> jbono...@apache.org 
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - 

Re:

2017-12-01 Thread NerdyNick
Assuming you're in Java. You could just follow on in your Main method.
Checking the state of the Result.

Example:
PipelineResult result = pipeline.run();
try {
result.waitUntilFinish();
if(result.getState() == PipelineResult.State.DONE) {
//DO ES work
}
} catch(Exception e) {
result.cancel();
throw e;
}

Otherwise you could also use Oozie to construct a work flow.

On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> yes, we had a similar question some days ago.
>
> We can imagine to have a user callback fn fired when the sink batch is
> complete.
>
> Let me think about that.
>
> Regards
> JB
>
> On 12/01/2017 09:04 AM, Philip Chan wrote:
>
>> Hey JB,
>>
>> Thanks for getting back so quickly.
>> I suppose in that case I would need a way of monitoring when the ES
>> transform completes successfully before I can proceed with doing the swap.
>> The problem with this is that I can't think of a good way to determine
>> that termination state short of polling the new index to check the document
>> count compared to the size of input PCollection.
>> That, or maybe I'd need to use an external system like you mentioned to
>> poll on the state of the pipeline (I'm using Google Dataflow, so maybe
>> there's a way to do this with some API).
>> But I would have thought that there would be an easy way of simply saying
>> "do not process this transform until this other transform completes".
>> Is there no established way of "signaling" between pipelines when some
>> pipeline completes, or have some way of declaring a dependency of 1
>> transform on another transform?
>>
>> Thanks again,
>> Philip
>>
>> On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi Philip,
>>
>> You won't be able to do (3) in the same pipeline as the Elasticsearch
>> Sink
>> PTransform ends the pipeline with PDone.
>>
>> So, (3) has to be done in another pipeline (using a DoFn) or in
>> another
>> "system" (like Camel for instance). I would do a check of the data in
>> the
>> index and then trigger the swap there.
>>
>> Regards
>> JB
>>
>> On 12/01/2017 08:41 AM, Philip Chan wrote:
>>
>> Hi,
>>
>> I'm pretty new to Beam, and I've been trying to use the
>> ElasticSearchIO
>> sink to write docs into ES.
>> With this, I want to be able to
>> 1. ingest and transform rows from DB (done)
>> 2. write JSON docs/strings into a new ES index (done)
>> 3. After (2) is complete and all documents are written into a new
>> index,
>> trigger an atomic index swap under an alias to replace the current
>> aliased index with the new index generated in step 2. This is
>> basically
>> a single POST request to the ES cluster.
>>
>> The problem I'm facing is that I don't seem to be able to find a
>> way to
>> have a way for (3) to happen after step (2) is complete.
>>
>> The ElasticSearchIO.Write transform returns a PDone, and I'm not
>> sure
>> how to proceed from there because it doesn't seem to let me do
>> another
>> apply on it to "define" a dependency.
>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>> > org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>
>> > org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>> > org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>>
>>
>> Is there a recommended way to construct pipelines workflows like
>> this?
>>
>> Thanks in advance,
>> Philip
>>
>>
>> -- Jean-Baptiste Onofré
>> jbono...@apache.org 
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Nick Verbeck - NerdyNick

NerdyNick.com
TrailsOffroad.com
NoKnownBoundaries.com


Re: Regarding Beam Slack Channel

2017-12-01 Thread Ziyad Muhammed
me too, thanks in advance

Best
Ziyad

On Fri, Dec 1, 2017 at 10:54 AM,  wrote:

> Hi
> Can I receive this invitation, too?
>
> Thanks
> Rick
>
> -Original Message-
> From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
> Sent: Friday, December 01, 2017 12:53 PM
> To: user@beam.apache.org
> Subject: Re: Regarding Beam Slack Channel
>
> Invite sent as well.
>
> Regards
> JB
>
> On 11/30/2017 07:19 PM, Yanael Barbier wrote:
> > Hello
> > Can I get an invite too?
> >
> > Thanks,
> > Yanael
> >
> > Le jeu. 30 nov. 2017 à 19:15, Wesley Tanaka  > > a écrit :
> >
> > Invite sent
> >
> >
> > On 11/30/2017 08:11 AM, Nalseez Duke wrote:
> >> Hello
> >>
> >> Can someone please add me to the Beam slack channel?
> >>
> >> Thanks.
> >
> >
> > --
> > Wesley Tanaka
> > https://wtanaka.com/
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
> --
> 本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain
> confidential information. Please do not use or disclose it in any way and
> delete it if you are not the intended recipient.
>


RE: Regarding Beam Slack Channel

2017-12-01 Thread linrick
Hi
Can I receive this invitation, too?

Thanks
Rick

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
Sent: Friday, December 01, 2017 12:53 PM
To: user@beam.apache.org
Subject: Re: Regarding Beam Slack Channel

Invite sent as well.

Regards
JB

On 11/30/2017 07:19 PM, Yanael Barbier wrote:
> Hello
> Can I get an invite too?
>
> Thanks,
> Yanael
>
> Le jeu. 30 nov. 2017 à 19:15, Wesley Tanaka  > a écrit :
>
> Invite sent
>
>
> On 11/30/2017 08:11 AM, Nalseez Duke wrote:
>> Hello
>>
>> Can someone please add me to the Beam slack channel?
>>
>> Thanks.
>
>
> --
> Wesley Tanaka
> https://wtanaka.com/
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
本信件可能包含工研院機密資訊,非指定之收件者,請勿使用或揭露本信件內容,並請銷毀此信件。 This email may contain 
confidential information. Please do not use or disclose it in any way and 
delete it if you are not the intended recipient.


Re:

2017-12-01 Thread Jean-Baptiste Onofré

Hi,

yes, we had a similar question some days ago.

We can imagine to have a user callback fn fired when the sink batch is complete.

Let me think about that.

Regards
JB

On 12/01/2017 09:04 AM, Philip Chan wrote:

Hey JB,

Thanks for getting back so quickly.
I suppose in that case I would need a way of monitoring when the ES transform 
completes successfully before I can proceed with doing the swap.
The problem with this is that I can't think of a good way to determine that 
termination state short of polling the new index to check the document count 
compared to the size of input PCollection.
That, or maybe I'd need to use an external system like you mentioned to poll on 
the state of the pipeline (I'm using Google Dataflow, so maybe there's a way to 
do this with some API).
But I would have thought that there would be an easy way of simply saying "do 
not process this transform until this other transform completes".
Is there no established way of "signaling" between pipelines when some pipeline 
completes, or have some way of declaring a dependency of 1 transform on another 
transform?


Thanks again,
Philip

On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré > wrote:


Hi Philip,

You won't be able to do (3) in the same pipeline as the Elasticsearch Sink
PTransform ends the pipeline with PDone.

So, (3) has to be done in another pipeline (using a DoFn) or in another
"system" (like Camel for instance). I would do a check of the data in the
index and then trigger the swap there.

Regards
JB

On 12/01/2017 08:41 AM, Philip Chan wrote:

Hi,

I'm pretty new to Beam, and I've been trying to use the ElasticSearchIO
sink to write docs into ES.
With this, I want to be able to
1. ingest and transform rows from DB (done)
2. write JSON docs/strings into a new ES index (done)
3. After (2) is complete and all documents are written into a new index,
trigger an atomic index swap under an alias to replace the current
aliased index with the new index generated in step 2. This is basically
a single POST request to the ES cluster.

The problem I'm facing is that I don't seem to be able to find a way to
have a way for (3) to happen after step (2) is complete.

The ElasticSearchIO.Write transform returns a PDone, and I'm not sure
how to proceed from there because it doesn't seem to let me do another
apply on it to "define" a dependency.

https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html



>

Is there a recommended way to construct pipelines workflows like this?

Thanks in advance,
Philip


-- 
Jean-Baptiste Onofré

jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re:

2017-12-01 Thread Philip Chan
Hey JB,

Thanks for getting back so quickly.
I suppose in that case I would need a way of monitoring when the ES
transform completes successfully before I can proceed with doing the swap.
The problem with this is that I can't think of a good way to determine that
termination state short of polling the new index to check the document
count compared to the size of input PCollection.
That, or maybe I'd need to use an external system like you mentioned to
poll on the state of the pipeline (I'm using Google Dataflow, so maybe
there's a way to do this with some API).
But I would have thought that there would be an easy way of simply saying
"do not process this transform until this other transform completes".
Is there no established way of "signaling" between pipelines when some
pipeline completes, or have some way of declaring a dependency of 1
transform on another transform?

Thanks again,
Philip

On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste Onofré 
wrote:

> Hi Philip,
>
> You won't be able to do (3) in the same pipeline as the Elasticsearch Sink
> PTransform ends the pipeline with PDone.
>
> So, (3) has to be done in another pipeline (using a DoFn) or in another
> "system" (like Camel for instance). I would do a check of the data in the
> index and then trigger the swap there.
>
> Regards
> JB
>
> On 12/01/2017 08:41 AM, Philip Chan wrote:
>
>> Hi,
>>
>> I'm pretty new to Beam, and I've been trying to use the ElasticSearchIO
>> sink to write docs into ES.
>> With this, I want to be able to
>> 1. ingest and transform rows from DB (done)
>> 2. write JSON docs/strings into a new ES index (done)
>> 3. After (2) is complete and all documents are written into a new index,
>> trigger an atomic index swap under an alias to replace the current aliased
>> index with the new index generated in step 2. This is basically a single
>> POST request to the ES cluster.
>>
>> The problem I'm facing is that I don't seem to be able to find a way to
>> have a way for (3) to happen after step (2) is complete.
>>
>> The ElasticSearchIO.Write transform returns a PDone, and I'm not sure how
>> to proceed from there because it doesn't seem to let me do another apply on
>> it to "define" a dependency.
>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html <
>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/
>> org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html>
>>
>> Is there a recommended way to construct pipelines workflows like this?
>>
>> Thanks in advance,
>> Philip
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>