Re: Libtaskotron - allow non-cli data input

2017-02-14 Thread Kamil Paral
> On Wed, Feb 8, 2017 at 2:26 PM, Kamil Paral < kpa...@redhat.com > wrote:

> > > This is what I meant - keeping item as is, but being able to pass another
> > > structure to the formula, which can then be used from it. I'd still like
> > > to
> > > keep the item to a single string, so it can be queried easily in the
> > > resultsdb. The item should still represent what was tested. It's just
> > > that
> > > I
> > > want to be able to pass arbitrary data to the formulae, without the need
> > > for
> > > ugly hacks like we have seen with the git commits lately.
> > 
> 

> > So, the question is now how much we want the `item` to uniquely identify
> > the
> > item under test. Currently we mostly do (rpmlint, rpmgrill) and sometimes
> > don't (depcheck, because item is NVR, but the full ID is NEVRA, and we
> > store
> > arch in the results extradata section).
> 

> I still kind of believe that the `item` should be chosen with great respect
> to what actually is the item under test, but it also really depends on what
> you want to do with it later on. Note that the `item` is actually a
> convention (yay, more water to adamw's "if we only had some awesome new
> project" mill), and is not enforced in any way. I believe that there should
> be firm rules (once again - conventions) on what the item is for each "well
> known" item type, so you can kind-of assume that if you query for
> `item=foo=koji_build` you are getting the results related to that
> build.
> As we were discussing privately with the item types (I'm not going to go into
> much detail here, but for the rest of you guys - I'm contemplating making
> the types more general, and using more of the 'metadata' to store additional
> spefics - like replacing `type=koji_build` with `type=build, source=koji`,
> or `type=build, source=brew` - on the high level, you know that a
> package/build was tested, and you don't really care where it came from, but
> you sometimes might care, and so there is the additional metadata stored. We
> could even have more types stored for one results, or I don't know... It's
> complicated), the idea behind item is that it should be a reasonable value,
> that carries the "what was tested" information, and you will use the other
> "extra-data" fields to provide more details (like we kind-of want to do with
> arch, but we don't really..). The reason for it to be 'reasonable value" and
> not "superset of all values that we have" is to make the general querying a
> bit more straightforward.

> > If we have structured input data, what happens to `item` for
> > check_modulemd?
> > Currently it is "namespace/module#commithash". Will it stay the same, and
> > they'll just avoid parsing it because we'll also provide ${data.namespace},
> > ${data.module} and ${data.hash}? Or will the `item` be perhaps just
> > "module"
> > (and the rest will be stored as extradata)? What happens when we have a
> > generic git_commit type, and the source can be an arbitrary service? Will
> > have some convention to use item as "giturl#commithash"?
> 

> Once again - whatever makes sense as the item. For me that would be the
> Repo/SHA combo, with server, repo, branch, and commit in extradata.
> And it comes to "storing as much relevant metadata as possible" once again.
> The thing is, that as long as stuff is predictable, it almost does not
> matter what it is, and it once again points out how good of an idea is the
> conventions stuff. I believe that we are now storing much less metadata in
> resultsdb than we should, and it is caused mostly by the fact that
> - we did not really need to use the results much so far
> - it is pretty hard to pass data into libtaskotron, and querying all the
> services all the time, to get the metadata, is/was deemed a bad idea - why
> do it ourselves, if the consumer can get it himself. They know that it is
> koji_build, so they can query koji.

> There is a fine balance to be struck, IMO, so we don't end up storing "all
> the data" in resultsdb. But I believe that the stuff relevant for the result
> consumption should be there.

> > Because the ideal way would be to store the whole item+data structure as
> > item
> > in resultsdb. But that's hard to query for humans, so we want a simple
> > string as an identifier.
> 

> This, for me, is once again about being predictable. As I said above, I still
> think that `item` should be a reasonable identifier, but not necessary a
> superset of all the info. That is what the extra data is for. Talking
> about...

> > But sometimes there can be a lot of data points which uniquely identify the
> > thing under test only when you specify it all (for example what Dan wrote,
> > sometimes the ID is the old NVR *plus* the new NVR). Will we want to
> > somehow
> > combine them into a single item value? We should give some directions how
> > people should construct their items.
> 

> My gut feeling here would be storing the "new NVR" (the thing that actually
> caused the test to be executed) 

Re: Libtaskotron - allow non-cli data input

2017-02-09 Thread Adam Williamson
On Thu, 2017-02-09 at 00:29 +0100, Josef Skladanka wrote:
> On Wed, Feb 8, 2017 at 8:06 PM, Adam Williamson 
> wrote:
> 
> > Wouldn't it be great if we had a brand new project which would be the
> > ideal place to represent such conventions, so the bit of taskotron
> > which reported the results could construct them conveniently? :P
> 
> 
> https://xkcd.com/684/ :) (I mean no offense just really reminded me of that)

Hmm, clearly we need a *** CONVENTION *** for quoting xkcd ;)
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Re: Libtaskotron - allow non-cli data input

2017-02-08 Thread Josef Skladanka
On Wed, Feb 8, 2017 at 7:39 PM, Kamil Paral  wrote:

> > I mentioned this in IRC but why not have a bit of both and allow input
> > as either a file or on the CLI. I don't think that json would be too
> > bad to type on the command line as an option for when you're running
> > something manually:
> >
> >   runtask sometask.yml -e "{'namespace':'someuser',\
> > 'module':'somemodule', 'commithash': 'abc123df980'}"
>
> I probably misunderstood you on IRC. In my older response here, I actually
> suggested something like this - having "--datafile data.json", which can
> also be used like "--datafile -" meaning stdin. You can then use "echo
>  | runtask --datafile - ". But your solution is probably
> easier to look at.
>

I honestl like the `--datafile [fname, -]` approach a lot. We could sure
name the param better, but that's about it. I like it better than
necessarily having a long cmdline, and you can still use "echo " if
you wanted to have a cmdline example, or "cat " for the common usage



> > There would be some risk of running into the same problems we had with
> > AutoQA where depcheck commands were too long for bash to parse but
> > that's when I'd say "you need to use a file for that"
>
> Definitely.
>

And that's why I'd rather stay away from long cmdlines :)


>
> > > I'm a bit torn between providing as much useful data as we can when
> > > scheduling (because a) yaml formulas are very limited and you can't
> > > do stuff like string parsing/splitting b) might save you a lot of
> > > work/code to have this data presented to you right from the start),
> > > and the easy manual execution (when you need to gather and provide
> > > all that data manually). It's probably about finding the right
> > > balance. We can't avoid having structured multi-data input, I don't
> > > think.
> >
> > If we did something along the lines of allowing input on the CLI, we
> > could have both, no? We'd need to be clear on the precedence of file vs
> > CLI input but that seems to me like something that could solve the
> > issue of dealing with more complicated inputs without requiring users
> > to futz with a file when running tasks locally.
>
> That's not the worry I had. Creating a file or writing json to a command
> line is a bit more work than the current state, but not a problem. What I'm
> a bit afraid of is that we'll start adding many keyvals into the json just
> because it is useful or convenient. As an artificial example, let's say for
> a koji_build FOO we supply NVR, name, epoch, owner, build_id and
> build_timestamp. And if we receive all of that in the fedmsg (or from some
> koji query that we'll need to do anyway for some reason), it makes sense to
> pass that data, it's free for us and it's less work for the task (it
> doesn't have to do its own queries). However, running the task manually as
> a task developer (and I don't mean re-running an existing task on FOO by
> copy-pasting the existing data json from a log file, but running it on a
> fresh new koji build BAR) makes it much more difficult for the developer,
> because he needs to figure out (manually) all those values for BAR just to
> be able to run his task.
>

Even more extreme (deliberately, to illustrate the point) example would be
> to pass the whole koji buildinfo dict structure that you get when running
> koji.getBuild(). Which could be actually easier for the developer to
> emulate, because we could document a single command that retrieves exactly
> that. Unless we start adding additional data to it...
>
> So on one hand, I'd like to pass as much data as we have to make task
> formulas simpler, but on the other hand, I'm afraid task development
> (manual task execution, without having a trigger to get all this data by
> magic) will get harder. (I hope I managed to explain it better this time:))
> _


As I mentioned in one of the other emails - the dev (while developing)
should really only need to provide the data that is relevant for the
task/formula. Why have a ton of stuff that you never use in the "testing
data" - it is unnecessary work, and even makes it more prone to error IMO.
If I had task that only needs NVR, name and build_timestamp, I'd (while
developing/testing) just pass a structure containing these.

Or do you think that is a bad idea? I sure can see how (e.g.) the resultsdb
directive could be spitting warnings out about missing data, but that is
why we have the different profiles - the resultsdb could fail in production
mode, if data was missing (and that probably means some serious error) or
just warn you in development mode.
If you wanted to "test it thoroughly" you'd better use some real data
anyway - and if we store the "input data structure" in logs for the tasks,
then there even is a good source of those, should you want to copy-paste it.

I hope I understood what you meant.

joza
___
qa-devel mailing list -- 

Re: Libtaskotron - allow non-cli data input

2017-02-08 Thread Josef Skladanka
On Wed, Feb 8, 2017 at 4:11 PM, Tim Flink  wrote:

> On Wed, 8 Feb 2017 08:26:30 -0500 (EST)
> Kamil Paral  wrote:
>
> I think another question is whether we want to keep assuming that the
> *user supplies the item* that is used as a UID in resultsdb. As you say,
> it seems a bit odd to require people to munge stuff together like
> "namespace/module#commithash" at the same time that it can be separated
> out into a dict-like data structure for easy access.
>
>
Emphasis mine. I think that we should not really be assuming that at all.
In most cases, the item should be provided by the trigger automagically,
the same with the type. With what I'd like to see for the structured input,
the conventions module could/should take that data into account while
constructing the "default" results.
Keep in mind, that the one result can also have multiple "items" (as it can
have a multiple of any extra data field), if it makes sense. One, the
"auto-provided" and the second could be user-added. That would make it both
consistent (the tirgger generated item) and flexible, if a different "item"
makes sense.

Would it make more sense to just pass in the dict and have semi-coded
> conventions for reporting to resultsdb based on the item_type which
> could be set during the task instead of requiring that to be known
> before task execution time?
>
> Something along the lines of enabling some common kinds of input for
> the resultsdb directive - module commit, dist-git rpm change, etc. so
> that you could specify the item_type to the resultsdb directive and it
> would know to look for certain bits to construct the UID item that's
> reported to resultsdb.
>

Yup, I think that setting some conventions, and making sure we keep the
same (or at least very similar) set of metadata for the relevant type is a
key.
I mentioned this in the previous email, but I am, in the past few days,
thinking about making the types a bit more general - the pretty specific
types we have now made sense, when we first designed stuff, and had a very
narrow usecase.
Now that we want to make the stack usable in stuff like Platform CI, I
think it would make sense to abstract a bit more, so we don't have
`koji_build`, `brew_build`, `copr_build` which are essentialy the same, but
differ in minor details. We can specify those classes/details in extradata,
or could even use multiple types - having the common set of information
guaranteed for all the 'build' type, and add other kind of data to
`koji_build`, `brew_build` of `whatever_build` as needed.


> Using Kamil's example, assume that we have a task for a module and the
> following data is passed in:
>
>   {'namespace':'someuser', 'module':'httpd', 'commithash':'abc123df980'}
>
> Neither item nor type is specified on the CLI at execution time. The
> task executes using that input data and when it comes time to report to
> resultsdb:
>
>   - name: report results to resultsdb
> resultsdb:
>   results: ${some_task_output}
>   type: module
>
> By passing in that type of module, the directive would look through the
> input data and construct the "item" from input.namespace, input.module
> and input.commithash.
>
> I'm not sure if it makes more sense to have a set of "types" that the
> resultsdb directive understands natively or to actually require item
> but allow variable names in it along the lines of
>
>   "item":"${namespace}/${module}#${commithash}"
>

I'd rather have that in "conventions" than the resultsdb directive, but I
guess it is essentialy the same thing, once you think about it.


>
> > > My take on this is, that we will say which variables are provided
> > > by the trigger for each type. If a variable is missing, the
> > > formula/execution should just crash when it tries to access it.
> >
> > Sounds reasonable.
>
> +1 from me as well. Assume everything is there, crash if there's
> something requested that isn't available (missing data etc.)
>
>
yup, that's what I have in mind.


> > We'll probably end up having a mix of necessary and convenience
> > values in the inputdata. "name" is probably a convenience value here,
> > so that tasks don't have to parse if they need to use it in a certain
> > directive. "epoch" might be an important value for some test cases,
> > and let's say we learn the value in trigger during scheduling
> > investigation, so we decide to pass it down. But that information is
> > not that easy to get manually. If you know what to do, you'll open up
> > a particular koji page and see it. But you can also be clueless about
> > how to figure it out. The same goes for build_id, again can be
> > important, but also can be retrieved later, so more of a convenience
> > data (saving you from writing a koji query). This is just an example
> > for illustration, might not match real-world use cases.
>
> I mentioned this in IRC but why not have a bit of both and allow input
> as either a file or on the CLI. I don't think that json would be 

Re: Libtaskotron - allow non-cli data input

2017-02-08 Thread Josef Skladanka
On Wed, Feb 8, 2017 at 2:26 PM, Kamil Paral  wrote:

> This is what I meant - keeping item as is, but being able to pass another
> structure to the formula, which can then be used from it. I'd still like to
> keep the item to a single string, so it can be queried easily in the
> resultsdb. The item should still represent what was tested. It's just that
> I want to be able to pass arbitrary data to the formulae, without the need
> for ugly hacks like we have seen with the git commits lately.
>
>
> So, the question is now how much we want the `item` to uniquely identify
> the item under test. Currently we mostly do (rpmlint, rpmgrill) and
> sometimes don't (depcheck, because item is NVR, but the full ID is NEVRA,
> and we store arch in the results extradata section).
>
>
I still kind of believe that the `item` should be chosen with great respect
to what actually is the item under test, but it also really depends on what
you want to do with it later on. Note that the `item` is actually a
convention (yay, more water to adamw's "if we only had some awesome new
project" mill), and is not enforced in any way. I believe that there should
be firm rules (once again - conventions) on what the item is for each "well
known" item type, so you can kind-of assume that if you query for
`item=foo=koji_build` you are getting the results related to that
build.
As we were discussing privately with the item types (I'm not going to go
into much detail here, but for the rest of you guys - I'm contemplating
making the types more general, and using more of the 'metadata' to store
additional spefics - like replacing `type=koji_build` with `type=build,
source=koji`, or `type=build, source=brew` - on the high level, you know
that a package/build was tested, and you don't really care where it came
from, but you sometimes might care, and so there is the additional metadata
stored. We could even have more types stored for one results, or I don't
know... It's complicated), the idea behind item is that it should be a
reasonable value, that carries the "what was tested" information, and you
will use the other "extra-data" fields to provide more details (like we
kind-of want to do with arch, but we don't really..). The reason for it to
be 'reasonable value" and not "superset of all values that we have" is to
make the general querying a bit more straightforward.


> If we have structured input data, what happens to `item` for
> check_modulemd? Currently it is "namespace/module#commithash". Will it stay
> the same, and they'll just avoid parsing it because we'll also provide
> ${data.namespace}, ${data.module} and ${data.hash}? Or will the `item` be
> perhaps just "module" (and the rest will be stored as extradata)? What
> happens when we have a generic git_commit type, and the source can be an
> arbitrary service? Will have some convention to use item as
> "giturl#commithash"?
>
>
Once again - whatever makes sense as the item. For me that would be the
Repo/SHA combo, with server, repo, branch, and commit in extradata.
And it comes to "storing as much relevant metadata as possible" once again.
The thing is, that as long as stuff is predictable, it almost does not
matter what it is, and it once again points out how good of an idea is the
conventions stuff. I believe that we are now storing much less metadata in
resultsdb than we should, and it is caused mostly by the fact that
 - we did not really need to use the results much so far
 - it is pretty hard to pass data into libtaskotron, and querying all the
services all the time, to get the metadata, is/was deemed a bad idea - why
do it ourselves, if the consumer can get it himself. They know that it is
koji_build, so they can query koji.

There is a fine balance to be struck, IMO, so we don't end up storing "all
the data" in resultsdb. But I believe that the stuff relevant for the
result consumption should be there.


Because the ideal way would be to store the whole item+data structure as
> item in resultsdb. But that's hard to query for humans, so we want a simple
> string as an identifier.
>

This, for me, is once again about being predictable. As I said above, I
still think that `item` should be a reasonable identifier, but not
necessary a superset of all the info. That is what the extra data is for.
Talking about...


> But sometimes there can be a lot of data points which uniquely identify
> the thing under test only when you specify it all (for example what Dan
> wrote, sometimes the ID is the old NVR *plus* the new NVR). Will we want to
> somehow combine them into a single item value? We should give some
> directions how people should construct their items.
>
>
My gut feeling here would be storing the "new NVR" (the thing that actually
caused the test to be executed) as item, and adding 'old nvr' to extra
data. But I'm not that familiar with the specific usecase. To me, this
would make sense, because when you query for "this NVR related results"
you'd get the results 

Re: Libtaskotron - allow non-cli data input

2017-02-08 Thread Adam Williamson
On Wed, 2017-02-08 at 08:11 -0700, Tim Flink wrote:
> Would it make more sense to just pass in the dict and have semi-coded
> conventions for reporting to resultsdb based on the item_type which
> could be set during the task instead of requiring that to be known
> before task execution time?

Wouldn't it be great if we had a brand new project which would be the
ideal place to represent such conventions, so the bit of taskotron
which reported the results could construct them conveniently? :P
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
http://www.happyassassin.net
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org


Libtaskotron - allow non-cli data input

2017-02-06 Thread Josef Skladanka
Chaps,

we were discussing this many times in the past, and as with the
type-restriction, I think this is the right time to get this done, actually.

It sure ties to the fact, that I'm trying to put together
Taskotron-continuously-testing-Taskotron together - the idea here being
that on each commit to a devel branch on any of the Taskotron components,
we will spin-up a testing instance of the whole stack, and run some
integration tests.

To do this, I added a new consumer to Trigger (
https://phab.qa.fedoraproject.org/D1110) that eats Pagure.io commits, and
spins jobs based on that.
This means, that I want to have the repo, branch and commit id as input for
the job, thus making yet-another-nasty-hack to pass the combined data into
the job (https://phab.qa.fedoraproject.org/D1110#C16697NL18) so I can hack
it apart later on either in the formula or in the task itself.

It would be very helpfull to be able to pass some structured data into the
task instead.

I kind of remember that we agreed on json/yaml. The possibilities were
either reading it from stdin or file. I don't really care that much either
way, but would probably feel a bit better about having a cli-param to pass
the filename there.

The formulas already provide a way to 'query' structured data via the
dot-format, so we could do with as much as passing some variable like
'task_data' that would contain the parsed json/yaml.

What do you think?

Joza
___
qa-devel mailing list -- qa-devel@lists.fedoraproject.org
To unsubscribe send an email to qa-devel-le...@lists.fedoraproject.org