Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings

2014-01-29 Thread Alexander Ignatov
Thank you for bringing this up, Trevor.

EDP gets more diverse and it's time to change its model.
I totally agree with your proposal, but one minor comment.
Instead of savanna. prefix in job_configs wouldn't it be better to make it
as edp.? I think savanna. is too more wide word for this.

And one more bureaucratic thing... I see you already started implementing it 
[1], 
and it is named and goes as new EDP workflow [2]. I think new bluprint should 
be 
created for this feature to track all code changes as well as docs updates. 
Docs I mean public Savanna docs about EDP, rest api docs and samples.

[1] https://review.openstack.org/#/c/69712
[2] 
https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce

Regards,
Alexander Ignatov



On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote:

 Hello all,
 
 In our first pass at EDP, the model for job settings was very consistent
 across all of our job types. The execution-time settings fit into this
 (superset) structure:
 
 job_configs = {'configs': {}, # config settings for oozie and hadoop
  'params': {},  # substitution values for Pig/Hive
  'args': []}# script args (Pig and Java actions)
 
 But we have some things that don't fit (and probably more in the
 future):
 
 1) Java jobs have 'main_class' and 'java_opts' settings
   Currently these are handled as additional fields added to the
 structure above.  These were the first to diverge.
 
 2) Streaming MapReduce (anticipated) requires mapper and reducer
 settings (different than the mapred..class settings for
 non-streaming MapReduce)
 
 Problems caused by adding fields
 
 The job_configs structure above is stored in the database. Each time we
 add a field to the structure above at the level of configs, params, and
 args, we force a change to the database tables, a migration script and a
 change to the JSON validation for the REST api.
 
 We also cause a change for python-savannaclient and potentially other
 clients.
 
 This kind of change seems bad.
 
 Proposal: Borrow a page from Oozie and add savanna. configs
 -
 I would like to fit divergent job settings into the structure we already
 have.  One way to do this is to leverage the 'configs' dictionary.  This
 dictionary primarily contains settings for hadoop, but there are a
 number of oozie.xxx settings that are passed to oozie as configs or
 set by oozie for the benefit of running apps.
 
 What if we allow savanna. settings to be added to configs?  If we do
 that, any and all special configuration settings for specific job types
 or subtypes can be handled with no database changes and no api changes.
 
 Downside
 
 Currently, all 'configs' are rendered in the generated oozie workflow.
 The savanna. settings would be stripped out and processed by Savanna,
 thereby changing that behavior a bit (maybe not a big deal)
 
 We would also be mixing savanna. configs with config_hints for jobs,
 so users would potentially see savanna. settings mixed with oozie
 and hadoop settings.  Again, maybe not a big deal, but it might blur the
 lines a little bit.  Personally, I'm okay with this.
 
 Slightly different
 --
 We could also add a 'savanna-configs': {} element to job_configs to
 keep the configuration spaces separate.
 
 But, now we would have 'savanna-configs' (or another name), 'configs',
 'params', and 'args'.  Really? Just how many different types of values
 can we come up with? :)
 
 I lean away from this approach.
 
 Related: breaking up the superset
 -
 
 It is also the case that not every job type has every value type.
 
 Configs   ParamsArgs
 HiveY YN
 Pig Y YY
 MapReduce   Y NN
 JavaY NY
 
 So do we make that explicit in the docs and enforce it in the api with
 errors?
 
 Thoughts? I'm sure there are some :)
 
 Best,
 
 Trevor
 
 
 
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings

2014-01-29 Thread Trevor McKay
So, assuming we go forward with this, the followup question is whether
or not to move main_class and java_opts for Java actions into
edp.java.main_class and edp.java.java_opts configs.

I think yes.

Best,

Trevor

On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote:
 On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote:
  Thank you for bringing this up, Trevor.
  
  EDP gets more diverse and it's time to change its model.
  I totally agree with your proposal, but one minor comment.
  Instead of savanna. prefix in job_configs wouldn't it be better to make it
  as edp.? I think savanna. is too more wide word for this.
 
 +1, brilliant. EDP is perfect.  I was worried about the scope of
 savanna. too.
 
  And one more bureaucratic thing... I see you already started implementing 
  it [1], 
  and it is named and goes as new EDP workflow [2]. I think new bluprint 
  should be 
  created for this feature to track all code changes as well as docs updates. 
  Docs I mean public Savanna docs about EDP, rest api docs and samples.
 
 Absolutely, I can make it new blueprint.  Thanks.
 
  [1] https://review.openstack.org/#/c/69712
  [2] 
  https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce
  
  Regards,
  Alexander Ignatov
  
  
  
  On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote:
  
   Hello all,
   
   In our first pass at EDP, the model for job settings was very consistent
   across all of our job types. The execution-time settings fit into this
   (superset) structure:
   
   job_configs = {'configs': {}, # config settings for oozie and hadoop
'params': {},  # substitution values for Pig/Hive
'args': []}# script args (Pig and Java actions)
   
   But we have some things that don't fit (and probably more in the
   future):
   
   1) Java jobs have 'main_class' and 'java_opts' settings
 Currently these are handled as additional fields added to the
   structure above.  These were the first to diverge.
   
   2) Streaming MapReduce (anticipated) requires mapper and reducer
   settings (different than the mapred..class settings for
   non-streaming MapReduce)
   
   Problems caused by adding fields
   
   The job_configs structure above is stored in the database. Each time we
   add a field to the structure above at the level of configs, params, and
   args, we force a change to the database tables, a migration script and a
   change to the JSON validation for the REST api.
   
   We also cause a change for python-savannaclient and potentially other
   clients.
   
   This kind of change seems bad.
   
   Proposal: Borrow a page from Oozie and add savanna. configs
   -
   I would like to fit divergent job settings into the structure we already
   have.  One way to do this is to leverage the 'configs' dictionary.  This
   dictionary primarily contains settings for hadoop, but there are a
   number of oozie.xxx settings that are passed to oozie as configs or
   set by oozie for the benefit of running apps.
   
   What if we allow savanna. settings to be added to configs?  If we do
   that, any and all special configuration settings for specific job types
   or subtypes can be handled with no database changes and no api changes.
   
   Downside
   
   Currently, all 'configs' are rendered in the generated oozie workflow.
   The savanna. settings would be stripped out and processed by Savanna,
   thereby changing that behavior a bit (maybe not a big deal)
   
   We would also be mixing savanna. configs with config_hints for jobs,
   so users would potentially see savanna. settings mixed with oozie
   and hadoop settings.  Again, maybe not a big deal, but it might blur the
   lines a little bit.  Personally, I'm okay with this.
   
   Slightly different
   --
   We could also add a 'savanna-configs': {} element to job_configs to
   keep the configuration spaces separate.
   
   But, now we would have 'savanna-configs' (or another name), 'configs',
   'params', and 'args'.  Really? Just how many different types of values
   can we come up with? :)
   
   I lean away from this approach.
   
   Related: breaking up the superset
   -
   
   It is also the case that not every job type has every value type.
   
   Configs   ParamsArgs
   HiveY YN
   Pig Y YY
   MapReduce   Y NN
   JavaY NY
   
   So do we make that explicit in the docs and enforce it in the api with
   errors?
   
   Thoughts? I'm sure there are some :)
   
   Best,
   
   Trevor
   
   
   
   
   
   
   ___
   OpenStack-dev mailing list
   OpenStack-dev@lists.openstack.org
   http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
  
  
  

Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings

2014-01-29 Thread Jon Maron
I imagine ‘neutron’ would follow suit as well..

On Jan 29, 2014, at 9:23 AM, Trevor McKay tmc...@redhat.com wrote:

 So, assuming we go forward with this, the followup question is whether
 or not to move main_class and java_opts for Java actions into
 edp.java.main_class and edp.java.java_opts configs.
 
 I think yes.
 
 Best,
 
 Trevor
 
 On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote:
 On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote:
 Thank you for bringing this up, Trevor.
 
 EDP gets more diverse and it's time to change its model.
 I totally agree with your proposal, but one minor comment.
 Instead of savanna. prefix in job_configs wouldn't it be better to make it
 as edp.? I think savanna. is too more wide word for this.
 
 +1, brilliant. EDP is perfect.  I was worried about the scope of
 savanna. too.
 
 And one more bureaucratic thing... I see you already started implementing 
 it [1], 
 and it is named and goes as new EDP workflow [2]. I think new bluprint 
 should be 
 created for this feature to track all code changes as well as docs updates. 
 Docs I mean public Savanna docs about EDP, rest api docs and samples.
 
 Absolutely, I can make it new blueprint.  Thanks.
 
 [1] https://review.openstack.org/#/c/69712
 [2] 
 https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce
 
 Regards,
 Alexander Ignatov
 
 
 
 On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote:
 
 Hello all,
 
 In our first pass at EDP, the model for job settings was very consistent
 across all of our job types. The execution-time settings fit into this
 (superset) structure:
 
 job_configs = {'configs': {}, # config settings for oozie and hadoop
   'params': {},  # substitution values for Pig/Hive
   'args': []}# script args (Pig and Java actions)
 
 But we have some things that don't fit (and probably more in the
 future):
 
 1) Java jobs have 'main_class' and 'java_opts' settings
  Currently these are handled as additional fields added to the
 structure above.  These were the first to diverge.
 
 2) Streaming MapReduce (anticipated) requires mapper and reducer
 settings (different than the mapred..class settings for
 non-streaming MapReduce)
 
 Problems caused by adding fields
 
 The job_configs structure above is stored in the database. Each time we
 add a field to the structure above at the level of configs, params, and
 args, we force a change to the database tables, a migration script and a
 change to the JSON validation for the REST api.
 
 We also cause a change for python-savannaclient and potentially other
 clients.
 
 This kind of change seems bad.
 
 Proposal: Borrow a page from Oozie and add savanna. configs
 -
 I would like to fit divergent job settings into the structure we already
 have.  One way to do this is to leverage the 'configs' dictionary.  This
 dictionary primarily contains settings for hadoop, but there are a
 number of oozie.xxx settings that are passed to oozie as configs or
 set by oozie for the benefit of running apps.
 
 What if we allow savanna. settings to be added to configs?  If we do
 that, any and all special configuration settings for specific job types
 or subtypes can be handled with no database changes and no api changes.
 
 Downside
 
 Currently, all 'configs' are rendered in the generated oozie workflow.
 The savanna. settings would be stripped out and processed by Savanna,
 thereby changing that behavior a bit (maybe not a big deal)
 
 We would also be mixing savanna. configs with config_hints for jobs,
 so users would potentially see savanna. settings mixed with oozie
 and hadoop settings.  Again, maybe not a big deal, but it might blur the
 lines a little bit.  Personally, I'm okay with this.
 
 Slightly different
 --
 We could also add a 'savanna-configs': {} element to job_configs to
 keep the configuration spaces separate.
 
 But, now we would have 'savanna-configs' (or another name), 'configs',
 'params', and 'args'.  Really? Just how many different types of values
 can we come up with? :)
 
 I lean away from this approach.
 
 Related: breaking up the superset
 -
 
 It is also the case that not every job type has every value type.
 
Configs   ParamsArgs
 HiveY YN
 Pig Y YY
 MapReduce   Y NN
 JavaY NY
 
 So do we make that explicit in the docs and enforce it in the api with
 errors?
 
 Thoughts? I'm sure there are some :)
 
 Best,
 
 Trevor
 
 
 
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 ___
 OpenStack-dev mailing list
 

Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings

2014-01-29 Thread Sergey Lukjanov
Trevor,

it sounds reasonable to move main_class and java_opts to edp.java.

Jon,

does you mean neutron-related info for namespaces support? If yes than
neutron isn't the user-side config.

Thanks.


On Wed, Jan 29, 2014 at 6:37 PM, Jon Maron jma...@hortonworks.com wrote:

 I imagine 'neutron' would follow suit as well..

 On Jan 29, 2014, at 9:23 AM, Trevor McKay tmc...@redhat.com wrote:

  So, assuming we go forward with this, the followup question is whether
  or not to move main_class and java_opts for Java actions into
  edp.java.main_class and edp.java.java_opts configs.
 
  I think yes.
 
  Best,
 
  Trevor
 
  On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote:
  On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote:
  Thank you for bringing this up, Trevor.
 
  EDP gets more diverse and it's time to change its model.
  I totally agree with your proposal, but one minor comment.
  Instead of savanna. prefix in job_configs wouldn't it be better to
 make it
  as edp.? I think savanna. is too more wide word for this.
 
  +1, brilliant. EDP is perfect.  I was worried about the scope of
  savanna. too.
 
  And one more bureaucratic thing... I see you already started
 implementing it [1],
  and it is named and goes as new EDP workflow [2]. I think new bluprint
 should be
  created for this feature to track all code changes as well as docs
 updates.
  Docs I mean public Savanna docs about EDP, rest api docs and samples.
 
  Absolutely, I can make it new blueprint.  Thanks.
 
  [1] https://review.openstack.org/#/c/69712
  [2]
 https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce
 
  Regards,
  Alexander Ignatov
 
 
 
  On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote:
 
  Hello all,
 
  In our first pass at EDP, the model for job settings was very
 consistent
  across all of our job types. The execution-time settings fit into this
  (superset) structure:
 
  job_configs = {'configs': {}, # config settings for oozie and hadoop
'params': {},  # substitution values for Pig/Hive
'args': []}# script args (Pig and Java actions)
 
  But we have some things that don't fit (and probably more in the
  future):
 
  1) Java jobs have 'main_class' and 'java_opts' settings
   Currently these are handled as additional fields added to the
  structure above.  These were the first to diverge.
 
  2) Streaming MapReduce (anticipated) requires mapper and reducer
  settings (different than the mapred..class settings for
  non-streaming MapReduce)
 
  Problems caused by adding fields
  
  The job_configs structure above is stored in the database. Each time
 we
  add a field to the structure above at the level of configs, params,
 and
  args, we force a change to the database tables, a migration script
 and a
  change to the JSON validation for the REST api.
 
  We also cause a change for python-savannaclient and potentially other
  clients.
 
  This kind of change seems bad.
 
  Proposal: Borrow a page from Oozie and add savanna. configs
  -
  I would like to fit divergent job settings into the structure we
 already
  have.  One way to do this is to leverage the 'configs' dictionary.
  This
  dictionary primarily contains settings for hadoop, but there are a
  number of oozie.xxx settings that are passed to oozie as configs or
  set by oozie for the benefit of running apps.
 
  What if we allow savanna. settings to be added to configs?  If we do
  that, any and all special configuration settings for specific job
 types
  or subtypes can be handled with no database changes and no api
 changes.
 
  Downside
  
  Currently, all 'configs' are rendered in the generated oozie workflow.
  The savanna. settings would be stripped out and processed by
 Savanna,
  thereby changing that behavior a bit (maybe not a big deal)
 
  We would also be mixing savanna. configs with config_hints for jobs,
  so users would potentially see savanna. settings mixed with
 oozie
  and hadoop settings.  Again, maybe not a big deal, but it might blur
 the
  lines a little bit.  Personally, I'm okay with this.
 
  Slightly different
  --
  We could also add a 'savanna-configs': {} element to job_configs to
  keep the configuration spaces separate.
 
  But, now we would have 'savanna-configs' (or another name), 'configs',
  'params', and 'args'.  Really? Just how many different types of values
  can we come up with? :)
 
  I lean away from this approach.
 
  Related: breaking up the superset
  -
 
  It is also the case that not every job type has every value type.
 
 Configs   ParamsArgs
  HiveY YN
  Pig Y YY
  MapReduce   Y NN
  JavaY NY
 
  So do we make that explicit in the docs and enforce it in the api 

Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings

2014-01-29 Thread Andrew Lazarev
I like idea of edp. prefix.

Andrew.


On Wed, Jan 29, 2014 at 6:23 AM, Trevor McKay tmc...@redhat.com wrote:

 So, assuming we go forward with this, the followup question is whether
 or not to move main_class and java_opts for Java actions into
 edp.java.main_class and edp.java.java_opts configs.

 I think yes.

 Best,

 Trevor

 On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote:
  On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote:
   Thank you for bringing this up, Trevor.
  
   EDP gets more diverse and it's time to change its model.
   I totally agree with your proposal, but one minor comment.
   Instead of savanna. prefix in job_configs wouldn't it be better to
 make it
   as edp.? I think savanna. is too more wide word for this.
 
  +1, brilliant. EDP is perfect.  I was worried about the scope of
  savanna. too.
 
   And one more bureaucratic thing... I see you already started
 implementing it [1],
   and it is named and goes as new EDP workflow [2]. I think new bluprint
 should be
   created for this feature to track all code changes as well as docs
 updates.
   Docs I mean public Savanna docs about EDP, rest api docs and samples.
 
  Absolutely, I can make it new blueprint.  Thanks.
 
   [1] https://review.openstack.org/#/c/69712
   [2]
 https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce
  
   Regards,
   Alexander Ignatov
  
  
  
   On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote:
  
Hello all,
   
In our first pass at EDP, the model for job settings was very
 consistent
across all of our job types. The execution-time settings fit into
 this
(superset) structure:
   
job_configs = {'configs': {}, # config settings for oozie and hadoop
 'params': {},  # substitution values for Pig/Hive
 'args': []}# script args (Pig and Java actions)
   
But we have some things that don't fit (and probably more in the
future):
   
1) Java jobs have 'main_class' and 'java_opts' settings
  Currently these are handled as additional fields added to the
structure above.  These were the first to diverge.
   
2) Streaming MapReduce (anticipated) requires mapper and reducer
settings (different than the mapred..class settings for
non-streaming MapReduce)
   
Problems caused by adding fields

The job_configs structure above is stored in the database. Each time
 we
add a field to the structure above at the level of configs, params,
 and
args, we force a change to the database tables, a migration script
 and a
change to the JSON validation for the REST api.
   
We also cause a change for python-savannaclient and potentially other
clients.
   
This kind of change seems bad.
   
Proposal: Borrow a page from Oozie and add savanna. configs
-
I would like to fit divergent job settings into the structure we
 already
have.  One way to do this is to leverage the 'configs' dictionary.
  This
dictionary primarily contains settings for hadoop, but there are a
number of oozie.xxx settings that are passed to oozie as configs or
set by oozie for the benefit of running apps.
   
What if we allow savanna. settings to be added to configs?  If we
 do
that, any and all special configuration settings for specific job
 types
or subtypes can be handled with no database changes and no api
 changes.
   
Downside

Currently, all 'configs' are rendered in the generated oozie
 workflow.
The savanna. settings would be stripped out and processed by
 Savanna,
thereby changing that behavior a bit (maybe not a big deal)
   
We would also be mixing savanna. configs with config_hints for
 jobs,
so users would potentially see savanna. settings mixed with
 oozie
and hadoop settings.  Again, maybe not a big deal, but it might blur
 the
lines a little bit.  Personally, I'm okay with this.
   
Slightly different
--
We could also add a 'savanna-configs': {} element to job_configs to
keep the configuration spaces separate.
   
But, now we would have 'savanna-configs' (or another name),
 'configs',
'params', and 'args'.  Really? Just how many different types of
 values
can we come up with? :)
   
I lean away from this approach.
   
Related: breaking up the superset
-
   
It is also the case that not every job type has every value type.
   
Configs   ParamsArgs
HiveY YN
Pig Y YY
MapReduce   Y NN
JavaY NY
   
So do we make that explicit in the docs and enforce it in the api
 with
errors?
   
Thoughts? I'm sure there are some :)
   
Best,
   
Trevor
   
   
   
   
  

[openstack-dev] [savanna] How to handle diverging EDP job configuration settings

2014-01-28 Thread Trevor McKay
Hello all,

In our first pass at EDP, the model for job settings was very consistent
across all of our job types. The execution-time settings fit into this
(superset) structure:

job_configs = {'configs': {}, # config settings for oozie and hadoop
   'params': {},  # substitution values for Pig/Hive
   'args': []}# script args (Pig and Java actions)

But we have some things that don't fit (and probably more in the
future):

1) Java jobs have 'main_class' and 'java_opts' settings
   Currently these are handled as additional fields added to the
structure above.  These were the first to diverge.

2) Streaming MapReduce (anticipated) requires mapper and reducer
settings (different than the mapred..class settings for
non-streaming MapReduce)

Problems caused by adding fields

The job_configs structure above is stored in the database. Each time we
add a field to the structure above at the level of configs, params, and
args, we force a change to the database tables, a migration script and a
change to the JSON validation for the REST api.

We also cause a change for python-savannaclient and potentially other
clients.

This kind of change seems bad.

Proposal: Borrow a page from Oozie and add savanna. configs
-
I would like to fit divergent job settings into the structure we already
have.  One way to do this is to leverage the 'configs' dictionary.  This
dictionary primarily contains settings for hadoop, but there are a
number of oozie.xxx settings that are passed to oozie as configs or
set by oozie for the benefit of running apps.

What if we allow savanna. settings to be added to configs?  If we do
that, any and all special configuration settings for specific job types
or subtypes can be handled with no database changes and no api changes.

Downside

Currently, all 'configs' are rendered in the generated oozie workflow.
The savanna. settings would be stripped out and processed by Savanna,
thereby changing that behavior a bit (maybe not a big deal)

We would also be mixing savanna. configs with config_hints for jobs,
so users would potentially see savanna. settings mixed with oozie
and hadoop settings.  Again, maybe not a big deal, but it might blur the
lines a little bit.  Personally, I'm okay with this.

Slightly different
--
We could also add a 'savanna-configs': {} element to job_configs to
keep the configuration spaces separate.

But, now we would have 'savanna-configs' (or another name), 'configs',
'params', and 'args'.  Really? Just how many different types of values
can we come up with? :)

I lean away from this approach.

Related: breaking up the superset
-

It is also the case that not every job type has every value type.

 Configs   ParamsArgs
HiveY YN
Pig Y YY
MapReduce   Y NN
JavaY NY

So do we make that explicit in the docs and enforce it in the api with
errors?

Thoughts? I'm sure there are some :)

Best,

Trevor



  


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev