Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
+dev 

On Fri, Jun 28, 2019 at 8:20 AM Chad Dombrova  wrote:

>
> I think the simplest solution would be to have some kind of override/hook
>> that allows Flink/Spark/... to provide storage. They already have a concept
>> of a job and know how to store them so can we piggyback the Beam pipeline
>> there.
>>
>
> That makes sense to me, since it avoids adding a dependency on a database
> like Mongo, which adds complexity to the deployment.  That said, Beam's
> definition of a job is different from Flink/Spark/etc.  To support this, a
> runner would need to support storing arbitrary metadata, so that the Beam
> Job Service could store a copy of each Beam job there (pipeline, pipeline
> options, etc), either directly as serialized protobuf messages, or by
> converting those to json.  Do you know offhand if Flink and Spark support
> that kind of arbitrary storage?
>
> -chad
>
>
>


Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Chad Dombrova
> I think the simplest solution would be to have some kind of override/hook
> that allows Flink/Spark/... to provide storage. They already have a concept
> of a job and know how to store them so can we piggyback the Beam pipeline
> there.
>

That makes sense to me, since it avoids adding a dependency on a database
like Mongo, which adds complexity to the deployment.  That said, Beam's
definition of a job is different from Flink/Spark/etc.  To support this, a
runner would need to support storing arbitrary metadata, so that the Beam
Job Service could store a copy of each Beam job there (pipeline, pipeline
options, etc), either directly as serialized protobuf messages, or by
converting those to json.  Do you know offhand if Flink and Spark support
that kind of arbitrary storage?

-chad


Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
I think the simplest solution would be to have some kind of override/hook
that allows Flink/Spark/... to provide storage. They already have a concept
of a job and know how to store them so can we piggyback the Beam pipeline
there.

On Fri, Jun 28, 2019 at 7:49 AM Chad Dombrova  wrote:

>
> In reality, a more complex job service is needed that is backed by some
>> kind of persistent storage or stateful service.
>>
>
> I was afraid you were going to say that :)  So is anything like this
> planned or in the works?
>
> I see that there's also ReferenceRunnerJobService, but it seems like a
> very similar implementation to InMemoryJobService.  What's the use case for
> that?
>
> -chad
>
>


Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Chad Dombrova
> In reality, a more complex job service is needed that is backed by some
> kind of persistent storage or stateful service.
>

I was afraid you were going to say that :)  So is anything like this
planned or in the works?

I see that there's also ReferenceRunnerJobService, but it seems like a very
similar implementation to InMemoryJobService.  What's the use case for that?

-chad


Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
The InMemoryJobService is meant to be a simple implementation. Adding a job
expiration N minutes after the job completes might make sense.

In reality, a more complex job service is needed that is backed by some
kind of persistent storage or stateful service.

On Thu, Jun 27, 2019 at 10:45 PM Chad Dombrova  wrote:

> Hi all,
> Thanks for all the support!
>
> I put together a rough working version of this already and it was quite
> easy, even for a Java newb.
>
> After playing with it a little I was surprised to find that:
>
> A) completed jobs are not cleared from the job service
> B) job info is not persisted between restarts of the job service
>
> It seems like this adds up to a memory leak which can only be resolved by
> restarting the service and thereby losing information about jobs which may
> be actively running.  How are people dealing with this currently?
>
> Note, I'm referring to the InMemoryJobService that's started like this:
>
> ./gradlew :beam-runners-flink-1.8-job-server:runShadow
> -PflinkMasterUrl=localhost:8081
>
> Thanks for the tip on the .dot graph exporter!  That will come in handy.
>
> -chad
>
>
> On Wed, Jun 26, 2019 at 6:39 AM Tim Robertson 
> wrote:
>
>> Another +1 to support your research into this Chad. Thank you.
>>
>> Trying to understand where a beam process is in the Spark DAG is... not
>> easy. A UI that helped would be a great addition.
>>
>>
>>
>> On Wed, Jun 26, 2019 at 3:30 PM Ismaël Mejía  wrote:
>>
>>> +1 don't hesitate to create a JIRA + PR. You may be interested in [1].
>>> This is a simple util class that takes a proto pipeline object and
>>> converts it into its graph representation in .dot format. You can
>>> easily reuse the code or the idea as a first approach to show what the
>>> pipeline is about.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/2df702a1448fa6cbd22cd225bf16e9ffc4c82595/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PortablePipelineDotRenderer.java#L29
>>>
>>> On Wed, Jun 26, 2019 at 10:27 AM Robert Bradshaw 
>>> wrote:
>>> >
>>> > Yes, offering a way to get a pipeline from the job service directly
>>> > would be a completely reasonable thing to do (and likely not hard at
>>> > all). We welcome pull requests.
>>> >
>>> > Alternative UIs built on top of this abstraction would be an
>>> > interesting project to explore.
>>> >
>>> > On Wed, Jun 26, 2019 at 8:44 AM Chad Dombrova 
>>> wrote:
>>> > >
>>> > > Hi all,
>>> > > I've been poking around the beam source code trying to determine
>>> whether it's possible to get the definition of a pipeline via beam's
>>> gPRC-based services.   It looks like the message types are there for
>>> describing a Pipeline but as far as I can tell, they're only used by
>>> JobService.Prepare() for submitting a new job.
>>> > >
>>> > > If I were to create a PR to add support for a
>>> JobService.GetPipeline() method, would that be interesting to others?  Is
>>> it technically feasible?  i.e. is the pipeline definition readily available
>>> to the job service after the job has been prepared and sent to the runner?
>>> > >
>>> > > Bigger picture, what I'm thinking about is writing a UI that's
>>> designed to view and monitor Beam pipelines via the portability
>>> abstraction, rather than using the (rather clunky) UIs that come with
>>> runners like Flink and Dataflow.  My thinking is that using beam's
>>> abstractions would future proof the UI by allowing it to work with any
>>> portable runner.  Right now it's just an idea, so I'd love to know what
>>> others think of this.
>>> > >
>>> > > thanks!
>>> > > -chad
>>> > >
>>>
>>


Re: gRPC method to get a pipeline definition?

2019-06-27 Thread Chad Dombrova
Hi all,
Thanks for all the support!

I put together a rough working version of this already and it was quite
easy, even for a Java newb.

After playing with it a little I was surprised to find that:

A) completed jobs are not cleared from the job service
B) job info is not persisted between restarts of the job service

It seems like this adds up to a memory leak which can only be resolved by
restarting the service and thereby losing information about jobs which may
be actively running.  How are people dealing with this currently?

Note, I'm referring to the InMemoryJobService that's started like this:

./gradlew :beam-runners-flink-1.8-job-server:runShadow
-PflinkMasterUrl=localhost:8081

Thanks for the tip on the .dot graph exporter!  That will come in handy.

-chad


On Wed, Jun 26, 2019 at 6:39 AM Tim Robertson 
wrote:

> Another +1 to support your research into this Chad. Thank you.
>
> Trying to understand where a beam process is in the Spark DAG is... not
> easy. A UI that helped would be a great addition.
>
>
>
> On Wed, Jun 26, 2019 at 3:30 PM Ismaël Mejía  wrote:
>
>> +1 don't hesitate to create a JIRA + PR. You may be interested in [1].
>> This is a simple util class that takes a proto pipeline object and
>> converts it into its graph representation in .dot format. You can
>> easily reuse the code or the idea as a first approach to show what the
>> pipeline is about.
>>
>> [1]
>> https://github.com/apache/beam/blob/2df702a1448fa6cbd22cd225bf16e9ffc4c82595/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PortablePipelineDotRenderer.java#L29
>>
>> On Wed, Jun 26, 2019 at 10:27 AM Robert Bradshaw 
>> wrote:
>> >
>> > Yes, offering a way to get a pipeline from the job service directly
>> > would be a completely reasonable thing to do (and likely not hard at
>> > all). We welcome pull requests.
>> >
>> > Alternative UIs built on top of this abstraction would be an
>> > interesting project to explore.
>> >
>> > On Wed, Jun 26, 2019 at 8:44 AM Chad Dombrova 
>> wrote:
>> > >
>> > > Hi all,
>> > > I've been poking around the beam source code trying to determine
>> whether it's possible to get the definition of a pipeline via beam's
>> gPRC-based services.   It looks like the message types are there for
>> describing a Pipeline but as far as I can tell, they're only used by
>> JobService.Prepare() for submitting a new job.
>> > >
>> > > If I were to create a PR to add support for a
>> JobService.GetPipeline() method, would that be interesting to others?  Is
>> it technically feasible?  i.e. is the pipeline definition readily available
>> to the job service after the job has been prepared and sent to the runner?
>> > >
>> > > Bigger picture, what I'm thinking about is writing a UI that's
>> designed to view and monitor Beam pipelines via the portability
>> abstraction, rather than using the (rather clunky) UIs that come with
>> runners like Flink and Dataflow.  My thinking is that using beam's
>> abstractions would future proof the UI by allowing it to work with any
>> portable runner.  Right now it's just an idea, so I'd love to know what
>> others think of this.
>> > >
>> > > thanks!
>> > > -chad
>> > >
>>
>


Re: gRPC method to get a pipeline definition?

2019-06-26 Thread Tim Robertson
Another +1 to support your research into this Chad. Thank you.

Trying to understand where a beam process is in the Spark DAG is... not
easy. A UI that helped would be a great addition.



On Wed, Jun 26, 2019 at 3:30 PM Ismaël Mejía  wrote:

> +1 don't hesitate to create a JIRA + PR. You may be interested in [1].
> This is a simple util class that takes a proto pipeline object and
> converts it into its graph representation in .dot format. You can
> easily reuse the code or the idea as a first approach to show what the
> pipeline is about.
>
> [1]
> https://github.com/apache/beam/blob/2df702a1448fa6cbd22cd225bf16e9ffc4c82595/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PortablePipelineDotRenderer.java#L29
>
> On Wed, Jun 26, 2019 at 10:27 AM Robert Bradshaw 
> wrote:
> >
> > Yes, offering a way to get a pipeline from the job service directly
> > would be a completely reasonable thing to do (and likely not hard at
> > all). We welcome pull requests.
> >
> > Alternative UIs built on top of this abstraction would be an
> > interesting project to explore.
> >
> > On Wed, Jun 26, 2019 at 8:44 AM Chad Dombrova  wrote:
> > >
> > > Hi all,
> > > I've been poking around the beam source code trying to determine
> whether it's possible to get the definition of a pipeline via beam's
> gPRC-based services.   It looks like the message types are there for
> describing a Pipeline but as far as I can tell, they're only used by
> JobService.Prepare() for submitting a new job.
> > >
> > > If I were to create a PR to add support for a JobService.GetPipeline()
> method, would that be interesting to others?  Is it technically feasible?
> i.e. is the pipeline definition readily available to the job service after
> the job has been prepared and sent to the runner?
> > >
> > > Bigger picture, what I'm thinking about is writing a UI that's
> designed to view and monitor Beam pipelines via the portability
> abstraction, rather than using the (rather clunky) UIs that come with
> runners like Flink and Dataflow.  My thinking is that using beam's
> abstractions would future proof the UI by allowing it to work with any
> portable runner.  Right now it's just an idea, so I'd love to know what
> others think of this.
> > >
> > > thanks!
> > > -chad
> > >
>


Re: gRPC method to get a pipeline definition?

2019-06-26 Thread Ismaël Mejía
+1 don't hesitate to create a JIRA + PR. You may be interested in [1].
This is a simple util class that takes a proto pipeline object and
converts it into its graph representation in .dot format. You can
easily reuse the code or the idea as a first approach to show what the
pipeline is about.

[1] 
https://github.com/apache/beam/blob/2df702a1448fa6cbd22cd225bf16e9ffc4c82595/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PortablePipelineDotRenderer.java#L29

On Wed, Jun 26, 2019 at 10:27 AM Robert Bradshaw  wrote:
>
> Yes, offering a way to get a pipeline from the job service directly
> would be a completely reasonable thing to do (and likely not hard at
> all). We welcome pull requests.
>
> Alternative UIs built on top of this abstraction would be an
> interesting project to explore.
>
> On Wed, Jun 26, 2019 at 8:44 AM Chad Dombrova  wrote:
> >
> > Hi all,
> > I've been poking around the beam source code trying to determine whether 
> > it's possible to get the definition of a pipeline via beam's gPRC-based 
> > services.   It looks like the message types are there for describing a 
> > Pipeline but as far as I can tell, they're only used by 
> > JobService.Prepare() for submitting a new job.
> >
> > If I were to create a PR to add support for a JobService.GetPipeline() 
> > method, would that be interesting to others?  Is it technically feasible?  
> > i.e. is the pipeline definition readily available to the job service after 
> > the job has been prepared and sent to the runner?
> >
> > Bigger picture, what I'm thinking about is writing a UI that's designed to 
> > view and monitor Beam pipelines via the portability abstraction, rather 
> > than using the (rather clunky) UIs that come with runners like Flink and 
> > Dataflow.  My thinking is that using beam's abstractions would future proof 
> > the UI by allowing it to work with any portable runner.  Right now it's 
> > just an idea, so I'd love to know what others think of this.
> >
> > thanks!
> > -chad
> >


Re: gRPC method to get a pipeline definition?

2019-06-26 Thread Robert Bradshaw
Yes, offering a way to get a pipeline from the job service directly
would be a completely reasonable thing to do (and likely not hard at
all). We welcome pull requests.

Alternative UIs built on top of this abstraction would be an
interesting project to explore.

On Wed, Jun 26, 2019 at 8:44 AM Chad Dombrova  wrote:
>
> Hi all,
> I've been poking around the beam source code trying to determine whether it's 
> possible to get the definition of a pipeline via beam's gPRC-based services.  
>  It looks like the message types are there for describing a Pipeline but as 
> far as I can tell, they're only used by JobService.Prepare() for submitting a 
> new job.
>
> If I were to create a PR to add support for a JobService.GetPipeline() 
> method, would that be interesting to others?  Is it technically feasible?  
> i.e. is the pipeline definition readily available to the job service after 
> the job has been prepared and sent to the runner?
>
> Bigger picture, what I'm thinking about is writing a UI that's designed to 
> view and monitor Beam pipelines via the portability abstraction, rather than 
> using the (rather clunky) UIs that come with runners like Flink and Dataflow. 
>  My thinking is that using beam's abstractions would future proof the UI by 
> allowing it to work with any portable runner.  Right now it's just an idea, 
> so I'd love to know what others think of this.
>
> thanks!
> -chad
>


gRPC method to get a pipeline definition?

2019-06-26 Thread Chad Dombrova
Hi all,
I've been poking around the beam source code trying to determine whether
it's possible to get the definition of a pipeline via beam's gPRC-based
services.   It looks like the message types are there for describing a
Pipeline

but
as far as I can tell, they're only used by JobService.Prepare() for
submitting a new job.

If I were to create a PR to add support for a JobService.GetPipeline()
method, would that be interesting to others?  Is it technically feasible?
i.e. is the pipeline definition readily available to the job service after
the job has been prepared and sent to the runner?

Bigger picture, what I'm thinking about is writing a UI that's designed to
view and monitor Beam pipelines via the portability abstraction, rather
than using the (rather clunky) UIs that come with runners like Flink and
Dataflow.  My thinking is that using beam's abstractions would future proof
the UI by allowing it to work with any portable runner.  Right now it's
just an idea, so I'd love to know what others think of this.

thanks!
-chad