Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-04-30 Thread lu jander
Hi Trevor

you mean adding an abstract layer for basic function like run job, cancel
job etc, the we implement these by specific job engine like oozie, Ooyala
job server etc, right ?

so I will do two things as below:
(1) working on the scheduler edp jobs by oozie coordination so make sure we
can running scheduler edp jobs  (currently working on ,will submit patch
later)
(2) abstract the basic function as the abstract layer, and rewrite the
specific job engine like oozie and ooyala job server. ( after doing first
step, will talk it with you)

so do you think  that I have caught you meanings? or missed something?

2015-04-22 21:59 GMT+08:00 Trevor McKay tmc...@redhat.com:

 On Wed, 2015-04-22 at 12:36 +, Chen, Ken wrote:
  o more complex workflows (job dependencies, DAGs, etc. Do we rely on
 Oozie, or something else?
   Huichun is now figuring this. I am not whether you guys already
 have some detail ideas about this? If needed we can contribute some effort.
 If no details are ready, we can help draw a draft version first.

 I just made a note on the pad

 https://etherpad.openstack.org/p/sahara-liberty-proposed-sessions

 Maybe the right approach here is to develop a mapping notation that can
 be expressed as a JSON object (like the proposed job interface mapping).

 If we can develop an abstract way to describe relationships between
 jobs, then the individual EDP engines can implement it. For the Oozie
 EDP engine, maybe it uses Oozie features in workflows.  For Spark, or
 Storm, maybe it uses some existing opensource coordinator or one is
 written.

 The key idea would be to make job coordination part of the EDP engine,
 with a well defined set of objects to describe the relationships.

 What do you think? Just a rough idea.  Maybe there is a better way.



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-04-22 Thread Trevor McKay
Hi Ken,

  responses inline

On Wed, 2015-04-22 at 12:36 +, Chen, Ken wrote:
 Hi Trevor, 
 I saw below items in Proposed Sprint Topics of sahara liberty.
 https://etherpad.openstack.org/p/sahara-liberty-proposed-sessions. I
 guess these are the EDP ideas we want to discuss on Vancouver design
 summit. We have some comments as below: 

Yes, feel free to add anything else to the pad.  We'll talk about as
much as we have time for.  I'm thinking that most of it will be covered
on Friday, or in between sessions during the week if folks are around.

 o   job scheduler (proposed by weiting) 
  we already have a spec on this, please help review it and give
 your comments and ideas. https://review.openstack.org/#/c/175719/  

Great! thanks

 o   more complex workflows (job dependencies, DAGs, etc. Do we
 rely on Oozie, or something else? 
  Huichun is now figuring this. I am not whether you guys already
 have some detail ideas about this? If needed we can contribute some
 effort. If no details are ready, we can help draw a draft version
 first. 

No work on this so far, although we have talked about it off and on for
a few cycles. Oozie has a lot of capabilities for coordination, but we
are not Oozie-only, so what do we do?  This is the central question.

 o   job interface mapping
 https://blueprints.launchpad.net/sahara/+spec/unified-job-interface-map 
 proposed in Kilo but moved to Liberty 
    ++ high priority in my opinion.  Should be done early, awesome
 feature 
  seems interesting. We agree EDP UI should be improved. In fact
 we have some unclear thinking about EDP inside our team. Some guys do
 not like current EDP design, and think it is more like a re-design
 of oozie or spark UI, instead of a universal interface to users.
 However, we have not a clear strategy on this part. 

Yes, Oozie had a heavy influence on EDP. This is partly historical --
EDP was written rapidly between Havanna and Icehouse and based on Oozie
since it offered handling of jobs and multiple types out of the box. It
was a quick path to EDP functionality.

However, EDP should be more of a universal interface. We only support a
few conceptual operations -- run job, cancel job, and job status. With
those three operations, we should be able to run anything. For example,
recently Telles has been working on Storm support.

The job interface mapping will help generalize how arguments are passed
to jobs and allow us to remove some assumptions about jobs.  I am all
for other generalizations that will move EDP further in the direction of
a general interface.

 o   early error detection to help transient clusters -- how many
 things can we detect early that can go wrong with an EDP job so that
 we return an error before spinning up the cluster (only to find that
 the job fails once the cluster is launched?) Ex, bad swift paths 
  seems easier, but may include some trivial work. 

Some of this will be folded into the job interface mapping.  Ethan has
just updated the spec to include input_datasource and
output_datasource as argument types. If we know what will be done with
a datasource, we can potentially validate it before the job runs.

 •   Spark plugins -- we have an independent Spark plugin, but we
 also have Spark supported by mapr, and in the future it will be
 supported by Ambari.  Should we continue to carry a simple Spark
 standalone plugin?  Or should we work toward shifting our Spark
 support to one or more vendor plugins? 
 Not sure what this will impact. 

Thought about this more. Overlap in the plugins is fine, as long as
there is someone in the community willing and able to support it. 

 -Ken 



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-04-22 Thread Chen, Ken
Hi Trevor,
I saw below items in Proposed Sprint Topics of sahara liberty. 
https://etherpad.openstack.org/p/sahara-liberty-proposed-sessions. I guess 
these are the EDP ideas we want to discuss on Vancouver design summit. We have 
some comments as below:

•   EDP Priorities in Liberty - At the last 2 summits, we've looked at 
possible work for EDP in the cycle and prioritized it. This is helpful, since 
there is probably more that could be done than can be done in a single cycle :)
o   job scheduler (proposed by weiting)
 we already have a spec on this, please help review it and give your 
comments and ideas. https://review.openstack.org/#/c/175719/ 

o   more complex workflows (job dependencies, DAGs, etc. Do we rely on 
Oozie, or something else?
 Huichun is now figuring this. I am not whether you guys already have 
some detail ideas about this? If needed we can contribute some effort. If no 
details are ready, we can help draw a draft version first.

o   job interface mapping 
https://blueprints.launchpad.net/sahara/+spec/unified-job-interface-map 
proposed in Kilo but moved to Liberty
   ++ high priority in my opinion.  Should be done early, awesome feature
 seems interesting. We agree EDP UI should be improved. In fact we have 
some unclear thinking about EDP inside our team. Some guys do not like current 
EDP design, and think it is more like a re-design of oozie or spark UI, 
instead of a universal interface to users. However, we have not a clear 
strategy on this part.

o   early error detection to help transient clusters -- how many things can 
we detect early that can go wrong with an EDP job so that we return an error 
before spinning up the cluster (only to find that the job fails once the 
cluster is launched?) Ex, bad swift paths
 seems easier, but may include some trivial work.

•   Spark plugins -- we have an independent Spark plugin, but we also have 
Spark supported by mapr, and in the future it will be supported by Ambari.  
Should we continue to carry a simple Spark standalone plugin?  Or should we 
work toward shifting our Spark support to one or more vendor plugins?
Not sure what this will impact.

-Ken


-Original Message-
From: Trevor McKay [mailto:tmc...@redhat.com] 
Sent: Tuesday, March 24, 2015 10:49 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

Weiting, Andrew,

Agreed, great ideas!  As Andrew noted, we have discussed some of these things 
before and it would be great to discuss them in Vancouver.

I think that a Sahara-side workflow manager is the right approach. Oozie has a 
lot of capability for job coordination, but it won't work for all of our 
cluster and job types.

Notes on Spark in particular -- when we implemented Spark EDP, we looked at 
various implementations for a Spark job server.  One was to extend Oozie, one 
was to use the Ooyala Spark job server, and one was to use ssh around 
spark-submit.  We chose the last, notes are here:

https://etherpad.openstack.org/p/sahara_spark_edp

We could potentially revisit the Ooyala job server.  My impression at the time 
was that for the functions we wanted, it was pretty heavy. But if we are going 
to add job coordination as a general feature, it may be appropriate. I believe 
in the Spark community it is the dominant solution for job management, open 
source is here:

https://github.com/spark-jobserver/spark-jobserver

As part of the Spark investigation, I posted on this JIRA, too. This is a JIRA 
for developing a REST api to the spark job server, which may be enough for us 
to build our own coordination system:

https://issues.apache.org/jira/browse/SPARK-3644

Best,

Trevor

On Tue, 2015-03-24 at 01:55 +, Chen, Weiting wrote:
 Hi Andrew.
 
  
 
 Thanks for response. My reply in line.
 
  
 
 From: Andrew Lazarev [mailto:alaza...@mirantis.com]
 Sent: Saturday, March 21, 2015 12:10 AM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] About Sahara EDP New Ideas for Liberty
 
  
 
 Hi Weiting,
 
  
 
 
 1. Add a schedule feature to run the jobs on time:
 
 
 This request comes from the customer, they usually run the job in a
 specific time every day. So it should be great if there
 
 
  is a scheduler to help arrange the regular job to run.
 
 
 Looks like a great feature. And should be quite easy to implement.
 Feel free to create spec for that.
 
 
 [Weiting] We are working on the spec and the bp has already been 
 registered in 
 https://blueprints.launchpad.net/sahara/+spec/enable-scheduled-edp-jobs.
 
  
 
 
 2. A more complex workflow design in Sahara EDP:
 
 
 Current EDP only provide one job that is running on one cluster.
 
 
 Yes. And ability to run several jobs in one oozie workflow is 
 discussed on every summit (e.g. 'coordinated jobs' at 
 https://etherpad.openstack.org/p/kilo-summit-sahara-edp). But for now

Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-04-22 Thread Trevor McKay
On Wed, 2015-04-22 at 12:36 +, Chen, Ken wrote:
 o more complex workflows (job dependencies, DAGs, etc. Do we rely on 
 Oozie, or something else?
  Huichun is now figuring this. I am not whether you guys already have 
 some detail ideas about this? If needed we can contribute some effort. If no 
 details are ready, we can help draw a draft version first.

I just made a note on the pad 

https://etherpad.openstack.org/p/sahara-liberty-proposed-sessions

Maybe the right approach here is to develop a mapping notation that can
be expressed as a JSON object (like the proposed job interface mapping).

If we can develop an abstract way to describe relationships between
jobs, then the individual EDP engines can implement it. For the Oozie
EDP engine, maybe it uses Oozie features in workflows.  For Spark, or
Storm, maybe it uses some existing opensource coordinator or one is
written.

The key idea would be to make job coordination part of the EDP engine,
with a well defined set of objects to describe the relationships.

What do you think? Just a rough idea.  Maybe there is a better way.



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-03-24 Thread Trevor McKay
Weiting, Andrew,

Agreed, great ideas!  As Andrew noted, we have discussed some of these
things before and it would be great to discuss them in Vancouver.

I think that a Sahara-side workflow manager is the right approach. Oozie
has a lot of capability for job coordination, but it won't work for all
of our cluster and job types.

Notes on Spark in particular -- when we implemented Spark EDP, we looked
at various implementations for a Spark job server.  One was to extend
Oozie, one was to use the Ooyala Spark job server, and one was to use
ssh around spark-submit.  We chose the last, notes are here:

https://etherpad.openstack.org/p/sahara_spark_edp

We could potentially revisit the Ooyala job server.  My impression at
the time was that for the functions we wanted, it was pretty heavy. But
if we are going to add job coordination as a general feature, it may be
appropriate. I believe in the Spark community it is the dominant
solution for job management, open source is here:

https://github.com/spark-jobserver/spark-jobserver

As part of the Spark investigation, I posted on this JIRA, too. This is
a JIRA for developing a REST api to the spark job server, which may be
enough for us to build our own coordination system:

https://issues.apache.org/jira/browse/SPARK-3644

Best,

Trevor

On Tue, 2015-03-24 at 01:55 +, Chen, Weiting wrote:
 Hi Andrew.
 
  
 
 Thanks for response. My reply in line.
 
  
 
 From: Andrew Lazarev [mailto:alaza...@mirantis.com] 
 Sent: Saturday, March 21, 2015 12:10 AM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] About Sahara EDP New Ideas for Liberty
 
  
 
 Hi Weiting,
 
  
 
 
 1. Add a schedule feature to run the jobs on time:
 
 
 This request comes from the customer, they usually run the job in a
 specific time every day. So it should be great if there
 
 
  is a scheduler to help arrange the regular job to run.
 
 
 Looks like a great feature. And should be quite easy to implement.
 Feel free to create spec for that.
 
 
 [Weiting] We are working on the spec and the bp has already been
 registered in
 https://blueprints.launchpad.net/sahara/+spec/enable-scheduled-edp-jobs.
 
  
 
 
 2. A more complex workflow design in Sahara EDP:
 
 
 Current EDP only provide one job that is running on one cluster.
 
 
 Yes. And ability to run several jobs in one oozie workflow is
 discussed on every summit (e.g. 'coordinated jobs' at
 https://etherpad.openstack.org/p/kilo-summit-sahara-edp). But for now
 it was not a priority
 
 
  
 
 
 But in a real case, it should be more complex, they usually use
 multiple jobs to calculate the data and may use several different type
 clusters to process it..
 
 
 It means that workflow manager should be on Sahara side. Looks like a
 complicated feature. But we would be happy to help with designing and
 implementing it. Please file proposal for design session on ongoing
 summit. Are you going to Vancouver?
 
 
 [Weiting] I’m not sure I will be there because the plan is still not
 ready yet. We are also looking for some customer’s real case in big
 data area and see how they are using data processing in current
 environment. However, for any idea we can update later. 
 
  
 
 
 Another concern is about Spark, for Spark it cannot use Oozie to do
 this. So we need to create an abstract layer to help to implement this
 kind of scenarios.
 
 
 If workflow is on Sahara side it should work automatically for all
 engines.
 
 [Weiting] Yes, agree.
 
 
  
 
 
 Thanks,
 
 
 Andrew.
 
  
 
 
  
 
 
  
 
 On Sun, Mar 8, 2015 at 3:17 AM, Chen, Weiting weiting.c...@intel.com
 wrote:
 
 Hi all.
 
  
 
 We got several feedbacks about Sahara EDP’s future from some
 China customers.
 
 Here are some ideas we would like to share with you and need
 your input if we can implement them in Sahara(Liberty).
 
  
 
 1. Add a schedule feature to run the jobs on time:
 
 This request comes from the customer, they usually run the job
 in a specific time every day. So it should be great if there
 is a scheduler to help arrange the regular job to run.
 
  
 
 2. A more complex workflow design in Sahara EDP:
 
 Current EDP only provide one job that is running on one
 cluster. 
 
 But in a real case, it should be more complex, they usually
 use multiple jobs to calculate the data and may use several
 different type clusters to process it.
 
 For example: Raw Data - Job A(Cluster A) - Job B(Cluster B)
 - Job C(Cluster A) - Result
 
 Actually in my opinion, this kind of job could be easy to
 implement by using Oozie as a workflow engine. But for current
 EDP, it doesn’t implement this kind of complex case.
 
 Another concern is about Spark

Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-03-23 Thread Chen, Weiting
Hi Andrew.

Thanks for response. My reply in line.

From: Andrew Lazarev [mailto:alaza...@mirantis.com]
Sent: Saturday, March 21, 2015 12:10 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

Hi Weiting,

1. Add a schedule feature to run the jobs on time:
This request comes from the customer, they usually run the job in a specific 
time every day. So it should be great if there
 is a scheduler to help arrange the regular job to run.
Looks like a great feature. And should be quite easy to implement. Feel free to 
create spec for that.
[Weiting] We are working on the spec and the bp has already been registered in 
https://blueprints.launchpad.net/sahara/+spec/enable-scheduled-edp-jobs.

2. A more complex workflow design in Sahara EDP:
Current EDP only provide one job that is running on one cluster.
Yes. And ability to run several jobs in one oozie workflow is discussed on 
every summit (e.g. 'coordinated jobs' at 
https://etherpad.openstack.org/p/kilo-summit-sahara-edp). But for now it was 
not a priority

But in a real case, it should be more complex, they usually use multiple jobs 
to calculate the data and may use several different type clusters to process 
it..
It means that workflow manager should be on Sahara side. Looks like a 
complicated feature. But we would be happy to help with designing and 
implementing it. Please file proposal for design session on ongoing summit. Are 
you going to Vancouver?
[Weiting] I’m not sure I will be there because the plan is still not ready yet. 
We are also looking for some customer’s real case in big data area and see how 
they are using data processing in current environment. However, for any idea we 
can update later.

Another concern is about Spark, for Spark it cannot use Oozie to do this. So 
we need to create an abstract layer to help to implement this kind of 
scenarios.
If workflow is on Sahara side it should work automatically for all engines.
[Weiting] Yes, agree.

Thanks,
Andrew.



On Sun, Mar 8, 2015 at 3:17 AM, Chen, Weiting 
weiting.c...@intel.commailto:weiting.c...@intel.com wrote:
Hi all.

We got several feedbacks about Sahara EDP’s future from some China customers.
Here are some ideas we would like to share with you and need your input if we 
can implement them in Sahara(Liberty).

1. Add a schedule feature to run the jobs on time:
This request comes from the customer, they usually run the job in a specific 
time every day. So it should be great if there is a scheduler to help arrange 
the regular job to run.

2. A more complex workflow design in Sahara EDP:
Current EDP only provide one job that is running on one cluster.
But in a real case, it should be more complex, they usually use multiple jobs 
to calculate the data and may use several different type clusters to process it.
For example: Raw Data - Job A(Cluster A) - Job B(Cluster B) - Job C(Cluster 
A) - Result
Actually in my opinion, this kind of job could be easy to implement by using 
Oozie as a workflow engine. But for current EDP, it doesn’t implement this kind 
of complex case.
Another concern is about Spark, for Spark it cannot use Oozie to do this. So we 
need to create an abstract layer to help to implement this kind of scenarios.

However, any suggestion is welcome.
Thanks.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org?subject:unsubscribehttp://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] About Sahara EDP New Ideas for Liberty

2015-03-20 Thread Andrew Lazarev
Hi Weiting,

1. Add a schedule feature to run the jobs on time:
This request comes from the customer, they usually run the job in a
specific time every day. So it should be great if there
 is a scheduler to help arrange the regular job to run.
Looks like a great feature. And should be quite easy to implement. Feel
free to create spec for that.

2. A more complex workflow design in Sahara EDP:
Current EDP only provide one job that is running on one cluster.
Yes. And ability to run several jobs in one oozie workflow is discussed on
every summit (e.g. 'coordinated jobs' at
https://etherpad.openstack.org/p/kilo-summit-sahara-edp). But for now it
was not a priority

But in a real case, it should be more complex, they usually use multiple
jobs to calculate the data and may use several different type clusters to
process it..
It means that workflow manager should be on Sahara side. Looks like a
complicated feature. But we would be happy to help with designing and
implementing it. Please file proposal for design session on ongoing summit.
Are you going to Vancouver?

Another concern is about Spark, for Spark it cannot use Oozie to do this.
So we need to create an abstract layer to help to implement this kind of
scenarios.
If workflow is on Sahara side it should work automatically for all engines.

Thanks,
Andrew.



On Sun, Mar 8, 2015 at 3:17 AM, Chen, Weiting weiting.c...@intel.com
wrote:

  Hi all.



 We got several feedbacks about Sahara EDP’s future from some China
 customers.

 Here are some ideas we would like to share with you and need your input if
 we can implement them in Sahara(Liberty).



 1. Add a schedule feature to run the jobs on time:

 This request comes from the customer, they usually run the job in a
 specific time every day. So it should be great if there is a scheduler to
 help arrange the regular job to run.



 2. A more complex workflow design in Sahara EDP:

 Current EDP only provide one job that is running on one cluster.

 But in a real case, it should be more complex, they usually use multiple
 jobs to calculate the data and may use several different type clusters to
 process it.

 For example: Raw Data - Job A(Cluster A) - Job B(Cluster B) - Job
 C(Cluster A) - Result

 Actually in my opinion, this kind of job could be easy to implement by
 using Oozie as a workflow engine. But for current EDP, it doesn’t implement
 this kind of complex case.

 Another concern is about Spark, for Spark it cannot use Oozie to do this.
 So we need to create an abstract layer to help to implement this kind of
 scenarios.



 However, any suggestion is welcome.

 Thanks.



 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] About Sahara EDP New Ideas for Liberty

2015-03-08 Thread Chen, Weiting
Hi all.

We got several feedbacks about Sahara EDP's future from some China customers.
Here are some ideas we would like to share with you and need your input if we 
can implement them in Sahara(Liberty).

1. Add a schedule feature to run the jobs on time:
This request comes from the customer, they usually run the job in a specific 
time every day. So it should be great if there is a scheduler to help arrange 
the regular job to run.

2. A more complex workflow design in Sahara EDP:
Current EDP only provide one job that is running on one cluster.
But in a real case, it should be more complex, they usually use multiple jobs 
to calculate the data and may use several different type clusters to process it.
For example: Raw Data - Job A(Cluster A) - Job B(Cluster B) - Job C(Cluster 
A) - Result
Actually in my opinion, this kind of job could be easy to implement by using 
Oozie as a workflow engine. But for current EDP, it doesn't implement this kind 
of complex case.
Another concern is about Spark, for Spark it cannot use Oozie to do this. So we 
need to create an abstract layer to help to implement this kind of scenarios.

However, any suggestion is welcome.
Thanks.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev