Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings
Thank you for bringing this up, Trevor. EDP gets more diverse and it's time to change its model. I totally agree with your proposal, but one minor comment. Instead of savanna. prefix in job_configs wouldn't it be better to make it as edp.? I think savanna. is too more wide word for this. And one more bureaucratic thing... I see you already started implementing it [1], and it is named and goes as new EDP workflow [2]. I think new bluprint should be created for this feature to track all code changes as well as docs updates. Docs I mean public Savanna docs about EDP, rest api docs and samples. [1] https://review.openstack.org/#/c/69712 [2] https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce Regards, Alexander Ignatov On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote: Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure: job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []}# script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred..class settings for non-streaming MapReduce) Problems caused by adding fields The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add savanna. configs - I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of oozie.xxx settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow savanna. settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside Currently, all 'configs' are rendered in the generated oozie workflow. The savanna. settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing savanna. configs with config_hints for jobs, so users would potentially see savanna. settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different -- We could also add a 'savanna-configs': {} element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset - It is also the case that not every job type has every value type. Configs ParamsArgs HiveY YN Pig Y YY MapReduce Y NN JavaY NY So do we make that explicit in the docs and enforce it in the api with errors? Thoughts? I'm sure there are some :) Best, Trevor ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings
So, assuming we go forward with this, the followup question is whether or not to move main_class and java_opts for Java actions into edp.java.main_class and edp.java.java_opts configs. I think yes. Best, Trevor On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote: On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote: Thank you for bringing this up, Trevor. EDP gets more diverse and it's time to change its model. I totally agree with your proposal, but one minor comment. Instead of savanna. prefix in job_configs wouldn't it be better to make it as edp.? I think savanna. is too more wide word for this. +1, brilliant. EDP is perfect. I was worried about the scope of savanna. too. And one more bureaucratic thing... I see you already started implementing it [1], and it is named and goes as new EDP workflow [2]. I think new bluprint should be created for this feature to track all code changes as well as docs updates. Docs I mean public Savanna docs about EDP, rest api docs and samples. Absolutely, I can make it new blueprint. Thanks. [1] https://review.openstack.org/#/c/69712 [2] https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce Regards, Alexander Ignatov On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote: Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure: job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []}# script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred..class settings for non-streaming MapReduce) Problems caused by adding fields The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add savanna. configs - I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of oozie.xxx settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow savanna. settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside Currently, all 'configs' are rendered in the generated oozie workflow. The savanna. settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing savanna. configs with config_hints for jobs, so users would potentially see savanna. settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different -- We could also add a 'savanna-configs': {} element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset - It is also the case that not every job type has every value type. Configs ParamsArgs HiveY YN Pig Y YY MapReduce Y NN JavaY NY So do we make that explicit in the docs and enforce it in the api with errors? Thoughts? I'm sure there are some :) Best, Trevor ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings
I imagine ‘neutron’ would follow suit as well.. On Jan 29, 2014, at 9:23 AM, Trevor McKay tmc...@redhat.com wrote: So, assuming we go forward with this, the followup question is whether or not to move main_class and java_opts for Java actions into edp.java.main_class and edp.java.java_opts configs. I think yes. Best, Trevor On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote: On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote: Thank you for bringing this up, Trevor. EDP gets more diverse and it's time to change its model. I totally agree with your proposal, but one minor comment. Instead of savanna. prefix in job_configs wouldn't it be better to make it as edp.? I think savanna. is too more wide word for this. +1, brilliant. EDP is perfect. I was worried about the scope of savanna. too. And one more bureaucratic thing... I see you already started implementing it [1], and it is named and goes as new EDP workflow [2]. I think new bluprint should be created for this feature to track all code changes as well as docs updates. Docs I mean public Savanna docs about EDP, rest api docs and samples. Absolutely, I can make it new blueprint. Thanks. [1] https://review.openstack.org/#/c/69712 [2] https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce Regards, Alexander Ignatov On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote: Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure: job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []}# script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred..class settings for non-streaming MapReduce) Problems caused by adding fields The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add savanna. configs - I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of oozie.xxx settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow savanna. settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside Currently, all 'configs' are rendered in the generated oozie workflow. The savanna. settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing savanna. configs with config_hints for jobs, so users would potentially see savanna. settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different -- We could also add a 'savanna-configs': {} element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset - It is also the case that not every job type has every value type. Configs ParamsArgs HiveY YN Pig Y YY MapReduce Y NN JavaY NY So do we make that explicit in the docs and enforce it in the api with errors? Thoughts? I'm sure there are some :) Best, Trevor ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list
Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings
Trevor, it sounds reasonable to move main_class and java_opts to edp.java. Jon, does you mean neutron-related info for namespaces support? If yes than neutron isn't the user-side config. Thanks. On Wed, Jan 29, 2014 at 6:37 PM, Jon Maron jma...@hortonworks.com wrote: I imagine 'neutron' would follow suit as well.. On Jan 29, 2014, at 9:23 AM, Trevor McKay tmc...@redhat.com wrote: So, assuming we go forward with this, the followup question is whether or not to move main_class and java_opts for Java actions into edp.java.main_class and edp.java.java_opts configs. I think yes. Best, Trevor On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote: On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote: Thank you for bringing this up, Trevor. EDP gets more diverse and it's time to change its model. I totally agree with your proposal, but one minor comment. Instead of savanna. prefix in job_configs wouldn't it be better to make it as edp.? I think savanna. is too more wide word for this. +1, brilliant. EDP is perfect. I was worried about the scope of savanna. too. And one more bureaucratic thing... I see you already started implementing it [1], and it is named and goes as new EDP workflow [2]. I think new bluprint should be created for this feature to track all code changes as well as docs updates. Docs I mean public Savanna docs about EDP, rest api docs and samples. Absolutely, I can make it new blueprint. Thanks. [1] https://review.openstack.org/#/c/69712 [2] https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce Regards, Alexander Ignatov On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote: Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure: job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []}# script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred..class settings for non-streaming MapReduce) Problems caused by adding fields The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add savanna. configs - I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of oozie.xxx settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow savanna. settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside Currently, all 'configs' are rendered in the generated oozie workflow. The savanna. settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing savanna. configs with config_hints for jobs, so users would potentially see savanna. settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different -- We could also add a 'savanna-configs': {} element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset - It is also the case that not every job type has every value type. Configs ParamsArgs HiveY YN Pig Y YY MapReduce Y NN JavaY NY So do we make that explicit in the docs and enforce it in the api
Re: [openstack-dev] [savanna] How to handle diverging EDP job configuration settings
I like idea of edp. prefix. Andrew. On Wed, Jan 29, 2014 at 6:23 AM, Trevor McKay tmc...@redhat.com wrote: So, assuming we go forward with this, the followup question is whether or not to move main_class and java_opts for Java actions into edp.java.main_class and edp.java.java_opts configs. I think yes. Best, Trevor On Wed, 2014-01-29 at 09:15 -0500, Trevor McKay wrote: On Wed, 2014-01-29 at 14:35 +0400, Alexander Ignatov wrote: Thank you for bringing this up, Trevor. EDP gets more diverse and it's time to change its model. I totally agree with your proposal, but one minor comment. Instead of savanna. prefix in job_configs wouldn't it be better to make it as edp.? I think savanna. is too more wide word for this. +1, brilliant. EDP is perfect. I was worried about the scope of savanna. too. And one more bureaucratic thing... I see you already started implementing it [1], and it is named and goes as new EDP workflow [2]. I think new bluprint should be created for this feature to track all code changes as well as docs updates. Docs I mean public Savanna docs about EDP, rest api docs and samples. Absolutely, I can make it new blueprint. Thanks. [1] https://review.openstack.org/#/c/69712 [2] https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce Regards, Alexander Ignatov On 28 Jan 2014, at 20:47, Trevor McKay tmc...@redhat.com wrote: Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure: job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []}# script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred..class settings for non-streaming MapReduce) Problems caused by adding fields The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add savanna. configs - I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of oozie.xxx settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow savanna. settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside Currently, all 'configs' are rendered in the generated oozie workflow. The savanna. settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing savanna. configs with config_hints for jobs, so users would potentially see savanna. settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different -- We could also add a 'savanna-configs': {} element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset - It is also the case that not every job type has every value type. Configs ParamsArgs HiveY YN Pig Y YY MapReduce Y NN JavaY NY So do we make that explicit in the docs and enforce it in the api with errors? Thoughts? I'm sure there are some :) Best, Trevor
[openstack-dev] [savanna] How to handle diverging EDP job configuration settings
Hello all, In our first pass at EDP, the model for job settings was very consistent across all of our job types. The execution-time settings fit into this (superset) structure: job_configs = {'configs': {}, # config settings for oozie and hadoop 'params': {}, # substitution values for Pig/Hive 'args': []}# script args (Pig and Java actions) But we have some things that don't fit (and probably more in the future): 1) Java jobs have 'main_class' and 'java_opts' settings Currently these are handled as additional fields added to the structure above. These were the first to diverge. 2) Streaming MapReduce (anticipated) requires mapper and reducer settings (different than the mapred..class settings for non-streaming MapReduce) Problems caused by adding fields The job_configs structure above is stored in the database. Each time we add a field to the structure above at the level of configs, params, and args, we force a change to the database tables, a migration script and a change to the JSON validation for the REST api. We also cause a change for python-savannaclient and potentially other clients. This kind of change seems bad. Proposal: Borrow a page from Oozie and add savanna. configs - I would like to fit divergent job settings into the structure we already have. One way to do this is to leverage the 'configs' dictionary. This dictionary primarily contains settings for hadoop, but there are a number of oozie.xxx settings that are passed to oozie as configs or set by oozie for the benefit of running apps. What if we allow savanna. settings to be added to configs? If we do that, any and all special configuration settings for specific job types or subtypes can be handled with no database changes and no api changes. Downside Currently, all 'configs' are rendered in the generated oozie workflow. The savanna. settings would be stripped out and processed by Savanna, thereby changing that behavior a bit (maybe not a big deal) We would also be mixing savanna. configs with config_hints for jobs, so users would potentially see savanna. settings mixed with oozie and hadoop settings. Again, maybe not a big deal, but it might blur the lines a little bit. Personally, I'm okay with this. Slightly different -- We could also add a 'savanna-configs': {} element to job_configs to keep the configuration spaces separate. But, now we would have 'savanna-configs' (or another name), 'configs', 'params', and 'args'. Really? Just how many different types of values can we come up with? :) I lean away from this approach. Related: breaking up the superset - It is also the case that not every job type has every value type. Configs ParamsArgs HiveY YN Pig Y YY MapReduce Y NN JavaY NY So do we make that explicit in the docs and enforce it in the api with errors? Thoughts? I'm sure there are some :) Best, Trevor ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev