Re: [DISCUSS] Migrate from Protractor to Cypress
While the Cypress team suggests taking advantage of stubs where you can, especially during development, we would definitely be able to test real endpoints [1]. It can be used exactly like how Protractor is used, but with the many benefits and features it provides [2]. Cypress also offers tools for unit testing [3], which I think may be causing confusion as to what exactly the library does. Cypress' main focus is e2e tests, but because of its architecture, it can be used for all types of tests. I agree with everything you mentioned, Mike. I think our approach now is fine, but in the future I do think it's worth considering the Cypress team's suggestions for when and when not to stub, but there are no hard and fast rules [4][5]. I currently have a branch available on my fork where I've migrated over some e2e tests from Protractor to Cypress. With the exception of a little code cleanup, these tests perform the same steps as they do with Protractor. I have yet to include instructions in the README or include an npm script, but if anyone wants to see it in action they can do the following: - download this branch: https://github.com/sardell/metron/tree/METRON-1648, - run `npm ci` from meron-alerts, - start the e2e test server, - run `./node_modules/.bin/cypress open` - start a single test by clicking on a file name in the Cypress user interface, or run them all by clicking the play button. I'll try to send some sort of benchmarks when I get a chance to show the speed difference between the two libraries. [1] https://docs.cypress.io/api/commands/request.html [2] https://www.cypress.io/features/ [3] https://docs.cypress.io/guides/guides/stubs-spies-and-clocks.html [4] https://docs.cypress.io/guides/guides/network-requests.html#Testing-Strategies . [5] https://docs.cypress.io/guides/getting-started/testing-your-app.html#Stubbing-the-Server On Thu, Sep 20, 2018 at 12:09 AM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > Shane, > > Can you elaborate on the testing model you're proposing? I looked through > the overview and some of the documentation. As far as I can tell, this > would effectively be and e2e test for the UI *only*, so we would be missing > testing the actual integration points with the REST API or any other > potential endpoints. > >1. Are you proposing we migrate all existing e2e tests, including those >that currently hit Elasticsearch? >2. Would shifting to Cypress mean that all e2e tests would be isolated >to only what is rendered via the browser? i.e. our e2e suite is no > longer >testing integration to a backend? > > My assumption with the term e2e testing is that you are testing an entire > vertical slice with no substantive mock/stub/fake/spy/dummy [1] in the way > except for maybe some strategic cross-cutting concerns. It sounds like > Cypress does NOT mean full e2e. My initial reaction to this is that there's > a place for both forms of testing. If Cypress would help UI developers work > on incremental changes, similar to how unit tests via JUnit help Java > developers iterate on features, then I think that's great. I'm all for > that! But unit tests are only one form of testing - we also do integration > testing, which can flex multiple classes/components together, as well as > more broad stack integration/functional testing that verifies everything > works when integrated together. Generally speaking, total # of unit tests > > # of integration tests > # functional/acceptance tests. I think we should > carve out and define a testing approach for each. Can you elaborate a bit > on your vision for how to manage the test interactions, or lack thereof, > with the REST API as an integration endpoint? [2] > > At the time the write-up James shared was written, it appears that Cypress > was not yet open source. Now, it's MIT license - > https://github.com/cypress-io/cypress/blob/develop/LICENSE.md. > > Mike > > 1. > > https://martinfowler.com/articles/mocksArentStubs.html#TheDifferenceBetweenMocksAndStubs > 2. https://martinfowler.com/articles/practical-test-pyramid.html#UiTests > > > On Wed, Sep 19, 2018 at 8:47 AM James Sirota wrote: > > > This article comparing the two is not favorable for Cypress. Are any of > > these concerns relevant to us? If not, then I think Cypress is fine > > > > > > > https://hackernoon.com/cypress-io-vs-protractor-e2e-testing-battle-d124ece91dc7 > > > > >
Re: [DISCUSS] Batch Profiler Feature Branch
I think I'm torn on this, specifically because it's batch and would generally be run as-needed. Justin, can you elaborate on your concerns there? This feels functionally very similar to our flat file loaders, which all have inputs for config from the CLI only. On the other hand, our flat file loaders are not typically seeding an existing structure. My concern of a local file profiler config stems from this stated goal: > The goal would be to enable “profile seeding” which allows profiles to be populated from a time before the profile was created. So if the config does not correctly match the profiler config held in ZK and the user runs the batch seeding job, what happens? On Thu, Sep 20, 2018 at 10:06 AM Justin Leet wrote: > The profile not being able to read from ZK feels like a fairly substantial, > if subtle, set of potential problems. I'd like to see that in either > before merging or at least pretty soon after merging. Is it a lot of work > to add that functionality based on where things are right now? > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen wrote: > > > Here is another limitation that I just thought. It can only read a > profile > > definition from a file. It probably also makes sense to add an option > that > > allows it to read the current Profiler configuration from Zookeeper. > > > > > > > Is it worth setting up a default config that pulls from the main > indexing > > output? > > > > Yes, I think that makes sense. We want the Batch Profiler to point to > the > > right HDFS URL, no matter where/how Metron is deployed. When Metron gets > > spun-up on a cluster, I should be able to just run the Batch Profiler > > without having to fuss with the input path. > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet > wrote: > > > > > Re: > > > > > > > * You do not configure the Batch Profiler in Ambari. It is > configured > > > > and executed completely from the command-line. > > > > > > > > > > Is it worth setting up a default config that pulls from the main > indexing > > > output? I'm a little on the fence about it, but it seems like making > the > > > most common case more or less built-in would be nice. > > > > > > Having said that, I do not consider that a requirement for merging the > > > feature branch. > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota > > wrote: > > > > > > > I think what you have outlined above is a good initial stab at the > > > > feature. Manual install of spark is not a big deal. Configuring via > > > > command line while we mature this feature is ok as well. Doesn't > look > > > like > > > > configuration steps are too hard. I think you should merge. > > > > > > > > James > > > > > > > > 19.09.2018, 08:15, "Nick Allen" : > > > > > I would like to open a discussion to get the Batch Profiler feature > > > > branch > > > > > merged into master as part of METRON-1699 [1] Create Batch > Profiler. > > > All > > > > > of the work that I had in mind for our first draft of the Batch > > > Profiler > > > > > has been completed. Please take a look through what I have and let > me > > > > know > > > > > if there are other features that you think are required *before* we > > > > merge. > > > > > > > > > > Previous list discussions on this topic include [2] and [3]. > > > > > > > > > > (Q) What can I do with the feature branch? > > > > > > > > > > * With the Batch Profiler, you can backfill/seed profiles using > > > > archived > > > > > telemetry. This enables the following types of use cases. > > > > > > > > > > 1. As a Security Data Scientist, I want to understand the > > > > historical > > > > > behaviors and trends of a profile that I have created so that I can > > > > > determine if I have created a feature set that has predictive value > > for > > > > > model building. > > > > > > > > > > 2. As a Security Data Scientist, I want to understand the > > > > historical > > > > > behaviors and trends of a profile that I have created so that I can > > > > > determine if I have defined the profile correctly and created a > > feature > > > > set > > > > > that matches reality. > > > > > > > > > > 3. As a Security Platform Engineer, I want to generate a > > profile > > > > > using archived telemetry when I deploy a new model to production so > > > that > > > > > models depending on that profile can function on day 1. > > > > > > > > > > * METRON-1699 [1] includes a more detailed description of the > > > feature. > > > > > > > > > > (Q) What work was completed? > > > > > > > > > > * The Batch Profiler runs on Spark and was implemented in Java to > > > > remain > > > > > consistent with our current Java-heavy code base. > > > > > > > > > > * The Batch Profiler is executed from the command-line. It can be > > > > > launched using a script or by calling `spark-submit`, which may be > > > useful > > > > > for advanced users. > > > > > > > > > > * Input telemetry can be consumed from multiple sources; for > > example > > >
Re: [DISCUSS] Batch Profiler Feature Branch
The profile not being able to read from ZK feels like a fairly substantial, if subtle, set of potential problems. I'd like to see that in either before merging or at least pretty soon after merging. Is it a lot of work to add that functionality based on where things are right now? On Thu, Sep 20, 2018 at 9:59 AM Nick Allen wrote: > Here is another limitation that I just thought. It can only read a profile > definition from a file. It probably also makes sense to add an option that > allows it to read the current Profiler configuration from Zookeeper. > > > > Is it worth setting up a default config that pulls from the main indexing > output? > > Yes, I think that makes sense. We want the Batch Profiler to point to the > right HDFS URL, no matter where/how Metron is deployed. When Metron gets > spun-up on a cluster, I should be able to just run the Batch Profiler > without having to fuss with the input path. > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet wrote: > > > Re: > > > > > * You do not configure the Batch Profiler in Ambari. It is configured > > > and executed completely from the command-line. > > > > > > > Is it worth setting up a default config that pulls from the main indexing > > output? I'm a little on the fence about it, but it seems like making the > > most common case more or less built-in would be nice. > > > > Having said that, I do not consider that a requirement for merging the > > feature branch. > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota > wrote: > > > > > I think what you have outlined above is a good initial stab at the > > > feature. Manual install of spark is not a big deal. Configuring via > > > command line while we mature this feature is ok as well. Doesn't look > > like > > > configuration steps are too hard. I think you should merge. > > > > > > James > > > > > > 19.09.2018, 08:15, "Nick Allen" : > > > > I would like to open a discussion to get the Batch Profiler feature > > > branch > > > > merged into master as part of METRON-1699 [1] Create Batch Profiler. > > All > > > > of the work that I had in mind for our first draft of the Batch > > Profiler > > > > has been completed. Please take a look through what I have and let me > > > know > > > > if there are other features that you think are required *before* we > > > merge. > > > > > > > > Previous list discussions on this topic include [2] and [3]. > > > > > > > > (Q) What can I do with the feature branch? > > > > > > > > * With the Batch Profiler, you can backfill/seed profiles using > > > archived > > > > telemetry. This enables the following types of use cases. > > > > > > > > 1. As a Security Data Scientist, I want to understand the > > > historical > > > > behaviors and trends of a profile that I have created so that I can > > > > determine if I have created a feature set that has predictive value > for > > > > model building. > > > > > > > > 2. As a Security Data Scientist, I want to understand the > > > historical > > > > behaviors and trends of a profile that I have created so that I can > > > > determine if I have defined the profile correctly and created a > feature > > > set > > > > that matches reality. > > > > > > > > 3. As a Security Platform Engineer, I want to generate a > profile > > > > using archived telemetry when I deploy a new model to production so > > that > > > > models depending on that profile can function on day 1. > > > > > > > > * METRON-1699 [1] includes a more detailed description of the > > feature. > > > > > > > > (Q) What work was completed? > > > > > > > > * The Batch Profiler runs on Spark and was implemented in Java to > > > remain > > > > consistent with our current Java-heavy code base. > > > > > > > > * The Batch Profiler is executed from the command-line. It can be > > > > launched using a script or by calling `spark-submit`, which may be > > useful > > > > for advanced users. > > > > > > > > * Input telemetry can be consumed from multiple sources; for > example > > > HDFS > > > > or the local file system. > > > > > > > > * Input telemetry can be consumed in multiple formats; for example > > JSON > > > > or ORC. > > > > > > > > * The 'output' profile measurements are persisted in HBase and is > > > > consistent with the Storm Profiler. > > > > > > > > * It can be run on any underlying engine supported by Spark. I have > > > > tested it both in 'local' mode and on a YARN cluster. > > > > > > > > * It is installed automatically by the Metron MPack. > > > > > > > > * A README was added that documents usage instructions. > > > > > > > > * The existing Profiler code was refactored so that as much code as > > > > possible is shared between the 3 Profiler ports; Storm, the Stellar > > REPL, > > > > and Spark. For example, the logic which determines the timestamp of a > > > > message was refactored so that it could be reused by all ports. > > > > > > > > * metron-profiler-common: The common Profiler code
Re: [DISCUSS] Batch Profiler Feature Branch
I think more often than not, you would want to load your profile definition from a file. This is why I considered the 'load from Zk' more of a nice-to-have. - In use case 1 and 2, this would definitely be the case. The profiles I am working with are speculative and I am using the batch profiler to determine if they are worth keeping. In this case, my speculative profiles would not be in Zk (yet). - In use case 3, I could see it go either way. It might be useful to load from Zk, but it certainly isn't a blocker. > So if the config does not correctly match the profiler config held in ZK and the user runs the batch seeding job, what happens? You would just get a profile that is slightly different over the entire time span. This is not a new risk. If the user changes their Profile definitions in Zk, the same thing would happen. On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > I think I'm torn on this, specifically because it's batch and would > generally be run as-needed. Justin, can you elaborate on your concerns > there? This feels functionally very similar to our flat file loaders, which > all have inputs for config from the CLI only. On the other hand, our flat > file loaders are not typically seeding an existing structure. My concern of > a local file profiler config stems from this stated goal: > > The goal would be to enable “profile seeding” which allows profiles to be > populated from a time before the profile was created. > So if the config does not correctly match the profiler config held in ZK > and the user runs the batch seeding job, what happens? > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet > wrote: > > > The profile not being able to read from ZK feels like a fairly > substantial, > > if subtle, set of potential problems. I'd like to see that in either > > before merging or at least pretty soon after merging. Is it a lot of > work > > to add that functionality based on where things are right now? > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen wrote: > > > > > Here is another limitation that I just thought. It can only read a > > profile > > > definition from a file. It probably also makes sense to add an option > > that > > > allows it to read the current Profiler configuration from Zookeeper. > > > > > > > > > > Is it worth setting up a default config that pulls from the main > > indexing > > > output? > > > > > > Yes, I think that makes sense. We want the Batch Profiler to point to > > the > > > right HDFS URL, no matter where/how Metron is deployed. When Metron > gets > > > spun-up on a cluster, I should be able to just run the Batch Profiler > > > without having to fuss with the input path. > > > > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet > > wrote: > > > > > > > Re: > > > > > > > > > * You do not configure the Batch Profiler in Ambari. It is > > configured > > > > > and executed completely from the command-line. > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the main > > indexing > > > > output? I'm a little on the fence about it, but it seems like making > > the > > > > most common case more or less built-in would be nice. > > > > > > > > Having said that, I do not consider that a requirement for merging > the > > > > feature branch. > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota > > > wrote: > > > > > > > > > I think what you have outlined above is a good initial stab at the > > > > > feature. Manual install of spark is not a big deal. Configuring > via > > > > > command line while we mature this feature is ok as well. Doesn't > > look > > > > like > > > > > configuration steps are too hard. I think you should merge. > > > > > > > > > > James > > > > > > > > > > 19.09.2018, 08:15, "Nick Allen" : > > > > > > I would like to open a discussion to get the Batch Profiler > feature > > > > > branch > > > > > > merged into master as part of METRON-1699 [1] Create Batch > > Profiler. > > > > All > > > > > > of the work that I had in mind for our first draft of the Batch > > > > Profiler > > > > > > has been completed. Please take a look through what I have and > let > > me > > > > > know > > > > > > if there are other features that you think are required *before* > we > > > > > merge. > > > > > > > > > > > > Previous list discussions on this topic include [2] and [3]. > > > > > > > > > > > > (Q) What can I do with the feature branch? > > > > > > > > > > > > * With the Batch Profiler, you can backfill/seed profiles using > > > > > archived > > > > > > telemetry. This enables the following types of use cases. > > > > > > > > > > > > 1. As a Security Data Scientist, I want to understand the > > > > > historical > > > > > > behaviors and trends of a profile that I have created so that I > can > > > > > > determine if I have created a feature set that has predictive > value > > > for > > > > > > model building. > > >
Re: [DISCUSS] Batch Profiler Feature Branch
> How do we establish "tm" from 1.1 above? Any concerns about overlap or gaps after the seeding is performed? Good point. Right now, if the Streaming and Batch Profiler overlap the last write wins. And presumably the output of the Streaming and Batch Profiler are the same, so no worries, right? :) So it kind of works, but it is definitely not ideal for use case 3. I could add --begin and --end args to constrain the time frame over which the Batch Profiler runs. I do not have that in the feature branch. It would be easy enough to add though. On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling > at this thread just a bit more... > >1. I have an existing system that's been up a while, and I have added k >profiles - assume these are the first profiles I've created. > 1. I would have t0 - tm (where m is the time when the profiles were > first installed) worth of data that has not been profiled yet. > 2. The batch profiler process would be to take that exact profile > definition from ZK and run the batch loader with that from the CLI. > 3. Profiles are now up to date from t0 - tCurrent >2. I've already done #1 above. Time goes by and now I want to add a new >profile. > 1. Same first step above > 2. I would run the batch loader with *only* that new profile > definition to seed? > > Forgive me if I missed this in PR's and discussion in the FB, but how do we > establish "tm" from 1.1 above? Any concerns about overlap or gaps after the > seeding is performed? > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen wrote: > > > I think more often than not, you would want to load your profile > definition > > from a file. This is why I considered the 'load from Zk' more of a > > nice-to-have. > > > >- In use case 1 and 2, this would definitely be the case. The > profiles > >I am working with are speculative and I am using the batch profiler to > >determine if they are worth keeping. In this case, my speculative > > profiles > >would not be in Zk (yet). > >- In use case 3, I could see it go either way. It might be useful to > >load from Zk, but it certainly isn't a blocker. > > > > > > > So if the config does not correctly match the profiler config held in > ZK > > and > > the user runs the batch seeding job, what happens? > > > > You would just get a profile that is slightly different over the entire > > time span. This is not a new risk. If the user changes their Profile > > definitions in Zk, the same thing would happen. > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > michael.miklav...@gmail.com> wrote: > > > > > I think I'm torn on this, specifically because it's batch and would > > > generally be run as-needed. Justin, can you elaborate on your concerns > > > there? This feels functionally very similar to our flat file loaders, > > which > > > all have inputs for config from the CLI only. On the other hand, our > flat > > > file loaders are not typically seeding an existing structure. My > concern > > of > > > a local file profiler config stems from this stated goal: > > > > The goal would be to enable “profile seeding” which allows profiles > to > > be > > > populated from a time before the profile was created. > > > So if the config does not correctly match the profiler config held in > ZK > > > and the user runs the batch seeding job, what happens? > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet > > > wrote: > > > > > > > The profile not being able to read from ZK feels like a fairly > > > substantial, > > > > if subtle, set of potential problems. I'd like to see that in either > > > > before merging or at least pretty soon after merging. Is it a lot of > > > work > > > > to add that functionality based on where things are right now? > > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen > wrote: > > > > > > > > > Here is another limitation that I just thought. It can only read a > > > > profile > > > > > definition from a file. It probably also makes sense to add an > > option > > > > that > > > > > allows it to read the current Profiler configuration from > Zookeeper. > > > > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the main > > > > indexing > > > > > output? > > > > > > > > > > Yes, I think that makes sense. We want the Batch Profiler to point > > to > > > > the > > > > > right HDFS URL, no matter where/how Metron is deployed. When > Metron > > > gets > > > > > spun-up on a cluster, I should be able to just run the Batch > Profiler > > > > > without having to fuss with the input path. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet > > > > > wrote: > > > > > > > > > > > Re: > > > > > > > > > > > > > * You do not configure the Batch Profiler in Ambari. It is > > >
Re: [DISCUSS] Batch Profiler Feature Branch
> It's just cleaner from a usage/management perspective to say "I want to put a profile in prod, just use streaming profiler and the batch profiler with the same setup and they're good to go." Agreed. I can add it. It would be a simple addition. On Thu, Sep 20, 2018 at 12:49 PM Justin Leet wrote: > I think the main difference between this and the flatfile loader is that we > actively maintain our profiles in ZK for streaming. Doing this from files > is likely going to be the main usage, particularly for speculative usage. > > For me, the main use case for ZK is definitely use case 3. > > I can definitely be persuaded that this isn't a blocker for right now, but > I think there will be problems in practice from not having the > functionality. E.g. "We want to refresh everything because of mistake X, > and nobody refreshed the file/ZK and they've diverged". While nobody likes > to refresh prod data (or some subset), I have seen it happen in literally > every single project I've worked on. On dev/integration environments this > is even more likely. Most people probably aren't going to store these > files in their version control (even though they probably should) and these > sort of divergences will happen. > > It's just cleaner from a usage/management perspective to say "I want to > put a profile in prod, just use streaming profiler and the batch profiler > with the same setup and they're good to go." > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > > > Ok, makes sense. That's sort of what I was thinking as well, Nick. > Pulling > > at this thread just a bit more... > > > >1. I have an existing system that's been up a while, and I have added > k > >profiles - assume these are the first profiles I've created. > > 1. I would have t0 - tm (where m is the time when the profiles were > > first installed) worth of data that has not been profiled yet. > > 2. The batch profiler process would be to take that exact profile > > definition from ZK and run the batch loader with that from the CLI. > > 3. Profiles are now up to date from t0 - tCurrent > >2. I've already done #1 above. Time goes by and now I want to add a > new > >profile. > > 1. Same first step above > > 2. I would run the batch loader with *only* that new profile > > definition to seed? > > > > Forgive me if I missed this in PR's and discussion in the FB, but how do > we > > establish "tm" from 1.1 above? Any concerns about overlap or gaps after > the > > seeding is performed? > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen wrote: > > > > > I think more often than not, you would want to load your profile > > definition > > > from a file. This is why I considered the 'load from Zk' more of a > > > nice-to-have. > > > > > >- In use case 1 and 2, this would definitely be the case. The > > profiles > > >I am working with are speculative and I am using the batch profiler > to > > >determine if they are worth keeping. In this case, my speculative > > > profiles > > >would not be in Zk (yet). > > >- In use case 3, I could see it go either way. It might be useful > to > > >load from Zk, but it certainly isn't a blocker. > > > > > > > > > > So if the config does not correctly match the profiler config held in > > ZK > > > and > > > the user runs the batch seeding job, what happens? > > > > > > You would just get a profile that is slightly different over the entire > > > time span. This is not a new risk. If the user changes their Profile > > > definitions in Zk, the same thing would happen. > > > > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > > michael.miklav...@gmail.com> wrote: > > > > > > > I think I'm torn on this, specifically because it's batch and would > > > > generally be run as-needed. Justin, can you elaborate on your > concerns > > > > there? This feels functionally very similar to our flat file loaders, > > > which > > > > all have inputs for config from the CLI only. On the other hand, our > > flat > > > > file loaders are not typically seeding an existing structure. My > > concern > > > of > > > > a local file profiler config stems from this stated goal: > > > > > The goal would be to enable “profile seeding” which allows profiles > > to > > > be > > > > populated from a time before the profile was created. > > > > So if the config does not correctly match the profiler config held in > > ZK > > > > and the user runs the batch seeding job, what happens? > > > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet > > > > wrote: > > > > > > > > > The profile not being able to read from ZK feels like a fairly > > > > substantial, > > > > > if subtle, set of potential problems. I'd like to see that in > either > > > > > before merging or at least pretty soon after merging. Is it a lot > of > > > > work > > > > > to add that functionality based on where things are right
Re: [DISCUSS] Batch Profiler Feature Branch
Assuming you have 9 months of data archived, yes. On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > So in the case of 3 - if you had 6 months of data that hadn't been profiled > and another 3 that had been profiled (9 months total data), in its current > form the batch job runs over all 9 months? > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen wrote: > > > > How do we establish "tm" from 1.1 above? Any concerns about overlap or > > gaps after the seeding is performed? > > > > Good point. Right now, if the Streaming and Batch Profiler overlap the > > last write wins. And presumably the output of the Streaming and Batch > > Profiler are the same, so no worries, right? :) > > > > So it kind of works, but it is definitely not ideal for use case 3. I > > could add --begin and --end args to constrain the time frame over which > the > > Batch Profiler runs. I do not have that in the feature branch. It would > > be easy enough to add though. > > > > > > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > > michael.miklav...@gmail.com> wrote: > > > > > Ok, makes sense. That's sort of what I was thinking as well, Nick. > > Pulling > > > at this thread just a bit more... > > > > > >1. I have an existing system that's been up a while, and I have > added > > k > > >profiles - assume these are the first profiles I've created. > > > 1. I would have t0 - tm (where m is the time when the profiles > were > > > first installed) worth of data that has not been profiled yet. > > > 2. The batch profiler process would be to take that exact profile > > > definition from ZK and run the batch loader with that from the > CLI. > > > 3. Profiles are now up to date from t0 - tCurrent > > >2. I've already done #1 above. Time goes by and now I want to add a > > new > > >profile. > > > 1. Same first step above > > > 2. I would run the batch loader with *only* that new profile > > > definition to seed? > > > > > > Forgive me if I missed this in PR's and discussion in the FB, but how > do > > we > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps after > > the > > > seeding is performed? > > > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen > wrote: > > > > > > > I think more often than not, you would want to load your profile > > > definition > > > > from a file. This is why I considered the 'load from Zk' more of a > > > > nice-to-have. > > > > > > > >- In use case 1 and 2, this would definitely be the case. The > > > profiles > > > >I am working with are speculative and I am using the batch > profiler > > to > > > >determine if they are worth keeping. In this case, my speculative > > > > profiles > > > >would not be in Zk (yet). > > > >- In use case 3, I could see it go either way. It might be useful > > to > > > >load from Zk, but it certainly isn't a blocker. > > > > > > > > > > > > > So if the config does not correctly match the profiler config held > in > > > ZK > > > > and > > > > the user runs the batch seeding job, what happens? > > > > > > > > You would just get a profile that is slightly different over the > entire > > > > time span. This is not a new risk. If the user changes their > Profile > > > > definitions in Zk, the same thing would happen. > > > > > > > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > > > michael.miklav...@gmail.com> wrote: > > > > > > > > > I think I'm torn on this, specifically because it's batch and would > > > > > generally be run as-needed. Justin, can you elaborate on your > > concerns > > > > > there? This feels functionally very similar to our flat file > loaders, > > > > which > > > > > all have inputs for config from the CLI only. On the other hand, > our > > > flat > > > > > file loaders are not typically seeding an existing structure. My > > > concern > > > > of > > > > > a local file profiler config stems from this stated goal: > > > > > > The goal would be to enable “profile seeding” which allows > profiles > > > to > > > > be > > > > > populated from a time before the profile was created. > > > > > So if the config does not correctly match the profiler config held > in > > > ZK > > > > > and the user runs the batch seeding job, what happens? > > > > > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet < > justinjl...@gmail.com> > > > > > wrote: > > > > > > > > > > > The profile not being able to read from ZK feels like a fairly > > > > > substantial, > > > > > > if subtle, set of potential problems. I'd like to see that in > > either > > > > > > before merging or at least pretty soon after merging. Is it a > lot > > of > > > > > work > > > > > > to add that functionality based on where things are right now? > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen > > > wrote: > > > > > > > > > > > > > Here is another limitation that I just thought. It can only > read > > a > > > > > >
Re: [DISCUSS] Batch Profiler Feature Branch
Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling at this thread just a bit more... 1. I have an existing system that's been up a while, and I have added k profiles - assume these are the first profiles I've created. 1. I would have t0 - tm (where m is the time when the profiles were first installed) worth of data that has not been profiled yet. 2. The batch profiler process would be to take that exact profile definition from ZK and run the batch loader with that from the CLI. 3. Profiles are now up to date from t0 - tCurrent 2. I've already done #1 above. Time goes by and now I want to add a new profile. 1. Same first step above 2. I would run the batch loader with *only* that new profile definition to seed? Forgive me if I missed this in PR's and discussion in the FB, but how do we establish "tm" from 1.1 above? Any concerns about overlap or gaps after the seeding is performed? On Thu, Sep 20, 2018 at 10:26 AM Nick Allen wrote: > I think more often than not, you would want to load your profile definition > from a file. This is why I considered the 'load from Zk' more of a > nice-to-have. > >- In use case 1 and 2, this would definitely be the case. The profiles >I am working with are speculative and I am using the batch profiler to >determine if they are worth keeping. In this case, my speculative > profiles >would not be in Zk (yet). >- In use case 3, I could see it go either way. It might be useful to >load from Zk, but it certainly isn't a blocker. > > > > So if the config does not correctly match the profiler config held in ZK > and > the user runs the batch seeding job, what happens? > > You would just get a profile that is slightly different over the entire > time span. This is not a new risk. If the user changes their Profile > definitions in Zk, the same thing would happen. > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > > > I think I'm torn on this, specifically because it's batch and would > > generally be run as-needed. Justin, can you elaborate on your concerns > > there? This feels functionally very similar to our flat file loaders, > which > > all have inputs for config from the CLI only. On the other hand, our flat > > file loaders are not typically seeding an existing structure. My concern > of > > a local file profiler config stems from this stated goal: > > > The goal would be to enable “profile seeding” which allows profiles to > be > > populated from a time before the profile was created. > > So if the config does not correctly match the profiler config held in ZK > > and the user runs the batch seeding job, what happens? > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet > > wrote: > > > > > The profile not being able to read from ZK feels like a fairly > > substantial, > > > if subtle, set of potential problems. I'd like to see that in either > > > before merging or at least pretty soon after merging. Is it a lot of > > work > > > to add that functionality based on where things are right now? > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen wrote: > > > > > > > Here is another limitation that I just thought. It can only read a > > > profile > > > > definition from a file. It probably also makes sense to add an > option > > > that > > > > allows it to read the current Profiler configuration from Zookeeper. > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the main > > > indexing > > > > output? > > > > > > > > Yes, I think that makes sense. We want the Batch Profiler to point > to > > > the > > > > right HDFS URL, no matter where/how Metron is deployed. When Metron > > gets > > > > spun-up on a cluster, I should be able to just run the Batch Profiler > > > > without having to fuss with the input path. > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 9:46 AM Justin Leet > > > wrote: > > > > > > > > > Re: > > > > > > > > > > > * You do not configure the Batch Profiler in Ambari. It is > > > configured > > > > > > and executed completely from the command-line. > > > > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the main > > > indexing > > > > > output? I'm a little on the fence about it, but it seems like > making > > > the > > > > > most common case more or less built-in would be nice. > > > > > > > > > > Having said that, I do not consider that a requirement for merging > > the > > > > > feature branch. > > > > > > > > > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota > > > > wrote: > > > > > > > > > > > I think what you have outlined above is a good initial stab at > the > > > > > > feature. Manual install of spark is not a big deal. Configuring > > via > > > > > > command line while we mature this feature is ok as well. Doesn't > > > look > > > > > like > > > > > > configuration steps are too hard. I
Re: [DISCUSS] Batch Profiler Feature Branch
I think the main difference between this and the flatfile loader is that we actively maintain our profiles in ZK for streaming. Doing this from files is likely going to be the main usage, particularly for speculative usage. For me, the main use case for ZK is definitely use case 3. I can definitely be persuaded that this isn't a blocker for right now, but I think there will be problems in practice from not having the functionality. E.g. "We want to refresh everything because of mistake X, and nobody refreshed the file/ZK and they've diverged". While nobody likes to refresh prod data (or some subset), I have seen it happen in literally every single project I've worked on. On dev/integration environments this is even more likely. Most people probably aren't going to store these files in their version control (even though they probably should) and these sort of divergences will happen. It's just cleaner from a usage/management perspective to say "I want to put a profile in prod, just use streaming profiler and the batch profiler with the same setup and they're good to go." On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > Ok, makes sense. That's sort of what I was thinking as well, Nick. Pulling > at this thread just a bit more... > >1. I have an existing system that's been up a while, and I have added k >profiles - assume these are the first profiles I've created. > 1. I would have t0 - tm (where m is the time when the profiles were > first installed) worth of data that has not been profiled yet. > 2. The batch profiler process would be to take that exact profile > definition from ZK and run the batch loader with that from the CLI. > 3. Profiles are now up to date from t0 - tCurrent >2. I've already done #1 above. Time goes by and now I want to add a new >profile. > 1. Same first step above > 2. I would run the batch loader with *only* that new profile > definition to seed? > > Forgive me if I missed this in PR's and discussion in the FB, but how do we > establish "tm" from 1.1 above? Any concerns about overlap or gaps after the > seeding is performed? > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen wrote: > > > I think more often than not, you would want to load your profile > definition > > from a file. This is why I considered the 'load from Zk' more of a > > nice-to-have. > > > >- In use case 1 and 2, this would definitely be the case. The > profiles > >I am working with are speculative and I am using the batch profiler to > >determine if they are worth keeping. In this case, my speculative > > profiles > >would not be in Zk (yet). > >- In use case 3, I could see it go either way. It might be useful to > >load from Zk, but it certainly isn't a blocker. > > > > > > > So if the config does not correctly match the profiler config held in > ZK > > and > > the user runs the batch seeding job, what happens? > > > > You would just get a profile that is slightly different over the entire > > time span. This is not a new risk. If the user changes their Profile > > definitions in Zk, the same thing would happen. > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > michael.miklav...@gmail.com> wrote: > > > > > I think I'm torn on this, specifically because it's batch and would > > > generally be run as-needed. Justin, can you elaborate on your concerns > > > there? This feels functionally very similar to our flat file loaders, > > which > > > all have inputs for config from the CLI only. On the other hand, our > flat > > > file loaders are not typically seeding an existing structure. My > concern > > of > > > a local file profiler config stems from this stated goal: > > > > The goal would be to enable “profile seeding” which allows profiles > to > > be > > > populated from a time before the profile was created. > > > So if the config does not correctly match the profiler config held in > ZK > > > and the user runs the batch seeding job, what happens? > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet > > > wrote: > > > > > > > The profile not being able to read from ZK feels like a fairly > > > substantial, > > > > if subtle, set of potential problems. I'd like to see that in either > > > > before merging or at least pretty soon after merging. Is it a lot of > > > work > > > > to add that functionality based on where things are right now? > > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen > wrote: > > > > > > > > > Here is another limitation that I just thought. It can only read a > > > > profile > > > > > definition from a file. It probably also makes sense to add an > > option > > > > that > > > > > allows it to read the current Profiler configuration from > Zookeeper. > > > > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the main > > > > indexing > > > > > output? > > > > > > > > > > Yes, I think
Re: [DISCUSS] Batch Profiler Feature Branch
So in the case of 3 - if you had 6 months of data that hadn't been profiled and another 3 that had been profiled (9 months total data), in its current form the batch job runs over all 9 months? On Thu, Sep 20, 2018 at 11:13 AM Nick Allen wrote: > > How do we establish "tm" from 1.1 above? Any concerns about overlap or > gaps after the seeding is performed? > > Good point. Right now, if the Streaming and Batch Profiler overlap the > last write wins. And presumably the output of the Streaming and Batch > Profiler are the same, so no worries, right? :) > > So it kind of works, but it is definitely not ideal for use case 3. I > could add --begin and --end args to constrain the time frame over which the > Batch Profiler runs. I do not have that in the feature branch. It would > be easy enough to add though. > > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > > > Ok, makes sense. That's sort of what I was thinking as well, Nick. > Pulling > > at this thread just a bit more... > > > >1. I have an existing system that's been up a while, and I have added > k > >profiles - assume these are the first profiles I've created. > > 1. I would have t0 - tm (where m is the time when the profiles were > > first installed) worth of data that has not been profiled yet. > > 2. The batch profiler process would be to take that exact profile > > definition from ZK and run the batch loader with that from the CLI. > > 3. Profiles are now up to date from t0 - tCurrent > >2. I've already done #1 above. Time goes by and now I want to add a > new > >profile. > > 1. Same first step above > > 2. I would run the batch loader with *only* that new profile > > definition to seed? > > > > Forgive me if I missed this in PR's and discussion in the FB, but how do > we > > establish "tm" from 1.1 above? Any concerns about overlap or gaps after > the > > seeding is performed? > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen wrote: > > > > > I think more often than not, you would want to load your profile > > definition > > > from a file. This is why I considered the 'load from Zk' more of a > > > nice-to-have. > > > > > >- In use case 1 and 2, this would definitely be the case. The > > profiles > > >I am working with are speculative and I am using the batch profiler > to > > >determine if they are worth keeping. In this case, my speculative > > > profiles > > >would not be in Zk (yet). > > >- In use case 3, I could see it go either way. It might be useful > to > > >load from Zk, but it certainly isn't a blocker. > > > > > > > > > > So if the config does not correctly match the profiler config held in > > ZK > > > and > > > the user runs the batch seeding job, what happens? > > > > > > You would just get a profile that is slightly different over the entire > > > time span. This is not a new risk. If the user changes their Profile > > > definitions in Zk, the same thing would happen. > > > > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > > michael.miklav...@gmail.com> wrote: > > > > > > > I think I'm torn on this, specifically because it's batch and would > > > > generally be run as-needed. Justin, can you elaborate on your > concerns > > > > there? This feels functionally very similar to our flat file loaders, > > > which > > > > all have inputs for config from the CLI only. On the other hand, our > > flat > > > > file loaders are not typically seeding an existing structure. My > > concern > > > of > > > > a local file profiler config stems from this stated goal: > > > > > The goal would be to enable “profile seeding” which allows profiles > > to > > > be > > > > populated from a time before the profile was created. > > > > So if the config does not correctly match the profiler config held in > > ZK > > > > and the user runs the batch seeding job, what happens? > > > > > > > > On Thu, Sep 20, 2018 at 10:06 AM Justin Leet > > > > wrote: > > > > > > > > > The profile not being able to read from ZK feels like a fairly > > > > substantial, > > > > > if subtle, set of potential problems. I'd like to see that in > either > > > > > before merging or at least pretty soon after merging. Is it a lot > of > > > > work > > > > > to add that functionality based on where things are right now? > > > > > > > > > > On Thu, Sep 20, 2018 at 9:59 AM Nick Allen > > wrote: > > > > > > > > > > > Here is another limitation that I just thought. It can only read > a > > > > > profile > > > > > > definition from a file. It probably also makes sense to add an > > > option > > > > > that > > > > > > allows it to read the current Profiler configuration from > > Zookeeper. > > > > > > > > > > > > > > > > > > > Is it worth setting up a default config that pulls from the > main > > > > > indexing > > > > > > output? > > > > > > > > > > > > Yes, I think that makes sense. We want the Batch Profiler to >
Re: [DISCUSS] Migrate from Protractor to Cypress
That's good feedback, thanks Shane! On Thu, Sep 20, 2018 at 6:23 AM Shane Ardell wrote: > While the Cypress team suggests taking advantage of stubs where you can, > especially during development, we would definitely be able to test real > endpoints [1]. It can be used exactly like how Protractor is used, but with > the many benefits and features it provides [2]. Cypress also offers tools > for unit testing [3], which I think may be causing confusion as to what > exactly the library does. Cypress' main focus is e2e tests, but because of > its architecture, it can be used for all types of tests. > > I agree with everything you mentioned, Mike. I think our approach now is > fine, but in the future I do think it's worth considering the Cypress > team's suggestions for when and when not to stub, but there are no hard and > fast rules [4][5]. > > I currently have a branch available on my fork where I've migrated over > some e2e tests from Protractor to Cypress. With the exception of a little > code cleanup, these tests perform the same steps as they do with > Protractor. I have yet to include instructions in the README or include an > npm script, but if anyone wants to see it in action they can do the > following: > >- download this branch: >https://github.com/sardell/metron/tree/METRON-1648, >- run `npm ci` from meron-alerts, >- start the e2e test server, >- run `./node_modules/.bin/cypress open` >- start a single test by clicking on a file name in the Cypress user >interface, or run them all by clicking the play button. > > I'll try to send some sort of benchmarks when I get a chance to show the > speed difference between the two libraries. > > [1] https://docs.cypress.io/api/commands/request.html > [2] https://www.cypress.io/features/ > [3] https://docs.cypress.io/guides/guides/stubs-spies-and-clocks.html > [4] > > https://docs.cypress.io/guides/guides/network-requests.html#Testing-Strategies > . > [5] > > https://docs.cypress.io/guides/getting-started/testing-your-app.html#Stubbing-the-Server > > On Thu, Sep 20, 2018 at 12:09 AM Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > > > Shane, > > > > Can you elaborate on the testing model you're proposing? I looked through > > the overview and some of the documentation. As far as I can tell, this > > would effectively be and e2e test for the UI *only*, so we would be > missing > > testing the actual integration points with the REST API or any other > > potential endpoints. > > > >1. Are you proposing we migrate all existing e2e tests, including > those > >that currently hit Elasticsearch? > >2. Would shifting to Cypress mean that all e2e tests would be isolated > >to only what is rendered via the browser? i.e. our e2e suite is no > > longer > >testing integration to a backend? > > > > My assumption with the term e2e testing is that you are testing an entire > > vertical slice with no substantive mock/stub/fake/spy/dummy [1] in the > way > > except for maybe some strategic cross-cutting concerns. It sounds like > > Cypress does NOT mean full e2e. My initial reaction to this is that > there's > > a place for both forms of testing. If Cypress would help UI developers > work > > on incremental changes, similar to how unit tests via JUnit help Java > > developers iterate on features, then I think that's great. I'm all for > > that! But unit tests are only one form of testing - we also do > integration > > testing, which can flex multiple classes/components together, as well as > > more broad stack integration/functional testing that verifies everything > > works when integrated together. Generally speaking, total # of unit > tests > > > # of integration tests > # functional/acceptance tests. I think we should > > carve out and define a testing approach for each. Can you elaborate a bit > > on your vision for how to manage the test interactions, or lack thereof, > > with the REST API as an integration endpoint? [2] > > > > At the time the write-up James shared was written, it appears that > Cypress > > was not yet open source. Now, it's MIT license - > > https://github.com/cypress-io/cypress/blob/develop/LICENSE.md. > > > > Mike > > > > 1. > > > > > https://martinfowler.com/articles/mocksArentStubs.html#TheDifferenceBetweenMocksAndStubs > > 2. https://martinfowler.com/articles/practical-test-pyramid.html#UiTests > > > > > > On Wed, Sep 19, 2018 at 8:47 AM James Sirota wrote: > > > > > This article comparing the two is not favorable for Cypress. Are any > of > > > these concerns relevant to us? If not, then I think Cypress is fine > > > > > > > > > > > > https://hackernoon.com/cypress-io-vs-protractor-e2e-testing-battle-d124ece91dc7 > > > > > > > > >
Re: [DISCUSS] Batch Profiler Feature Branch
Here is another limitation that I just thought. It can only read a profile definition from a file. It probably also makes sense to add an option that allows it to read the current Profiler configuration from Zookeeper. > Is it worth setting up a default config that pulls from the main indexing output? Yes, I think that makes sense. We want the Batch Profiler to point to the right HDFS URL, no matter where/how Metron is deployed. When Metron gets spun-up on a cluster, I should be able to just run the Batch Profiler without having to fuss with the input path. On Thu, Sep 20, 2018 at 9:46 AM Justin Leet wrote: > Re: > > > * You do not configure the Batch Profiler in Ambari. It is configured > > and executed completely from the command-line. > > > > Is it worth setting up a default config that pulls from the main indexing > output? I'm a little on the fence about it, but it seems like making the > most common case more or less built-in would be nice. > > Having said that, I do not consider that a requirement for merging the > feature branch. > > On Wed, Sep 19, 2018 at 11:23 AM James Sirota wrote: > > > I think what you have outlined above is a good initial stab at the > > feature. Manual install of spark is not a big deal. Configuring via > > command line while we mature this feature is ok as well. Doesn't look > like > > configuration steps are too hard. I think you should merge. > > > > James > > > > 19.09.2018, 08:15, "Nick Allen" : > > > I would like to open a discussion to get the Batch Profiler feature > > branch > > > merged into master as part of METRON-1699 [1] Create Batch Profiler. > All > > > of the work that I had in mind for our first draft of the Batch > Profiler > > > has been completed. Please take a look through what I have and let me > > know > > > if there are other features that you think are required *before* we > > merge. > > > > > > Previous list discussions on this topic include [2] and [3]. > > > > > > (Q) What can I do with the feature branch? > > > > > > * With the Batch Profiler, you can backfill/seed profiles using > > archived > > > telemetry. This enables the following types of use cases. > > > > > > 1. As a Security Data Scientist, I want to understand the > > historical > > > behaviors and trends of a profile that I have created so that I can > > > determine if I have created a feature set that has predictive value for > > > model building. > > > > > > 2. As a Security Data Scientist, I want to understand the > > historical > > > behaviors and trends of a profile that I have created so that I can > > > determine if I have defined the profile correctly and created a feature > > set > > > that matches reality. > > > > > > 3. As a Security Platform Engineer, I want to generate a profile > > > using archived telemetry when I deploy a new model to production so > that > > > models depending on that profile can function on day 1. > > > > > > * METRON-1699 [1] includes a more detailed description of the > feature. > > > > > > (Q) What work was completed? > > > > > > * The Batch Profiler runs on Spark and was implemented in Java to > > remain > > > consistent with our current Java-heavy code base. > > > > > > * The Batch Profiler is executed from the command-line. It can be > > > launched using a script or by calling `spark-submit`, which may be > useful > > > for advanced users. > > > > > > * Input telemetry can be consumed from multiple sources; for example > > HDFS > > > or the local file system. > > > > > > * Input telemetry can be consumed in multiple formats; for example > JSON > > > or ORC. > > > > > > * The 'output' profile measurements are persisted in HBase and is > > > consistent with the Storm Profiler. > > > > > > * It can be run on any underlying engine supported by Spark. I have > > > tested it both in 'local' mode and on a YARN cluster. > > > > > > * It is installed automatically by the Metron MPack. > > > > > > * A README was added that documents usage instructions. > > > > > > * The existing Profiler code was refactored so that as much code as > > > possible is shared between the 3 Profiler ports; Storm, the Stellar > REPL, > > > and Spark. For example, the logic which determines the timestamp of a > > > message was refactored so that it could be reused by all ports. > > > > > > * metron-profiler-common: The common Profiler code shared amongst > > > each port. > > > * metron-profiler-storm: Profiler on Storm > > > * metron-profiler-spark: Profiler on Spark > > > * metron-profiler-repl: Profiler on the Stellar REPL > > > * metron-profiler-client: The client code for retrieving profile > > > data; for example PROFILE_GET. > > > > > > * There are 3 separate RPM and DEB packages now created for the > > Profiler. > > > > > > * metron-profiler-storm-*.rpm > > > * metron-profiler-spark-*.rpm > > > * metron-profiler-repl-*.rpm > > > > > > * The Profiler
Re: [DISCUSS] Batch Profiler Feature Branch
I think we might want to allow the flexibility to choose the date range then. I don't yet feel like I have a good enough understanding of all the ways in which users would want to seed to force them to run the batch job over all the data. It might also make it easier to deal with remediation, ie an error doesn't force you to re-run over the entire history. Same goes for testing out the profile seeing batch job in the first place. On Thu, Sep 20, 2018 at 11:23 AM Nick Allen wrote: > Assuming you have 9 months of data archived, yes. > > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic < > michael.miklav...@gmail.com> wrote: > > > So in the case of 3 - if you had 6 months of data that hadn't been > profiled > > and another 3 that had been profiled (9 months total data), in its > current > > form the batch job runs over all 9 months? > > > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen wrote: > > > > > > How do we establish "tm" from 1.1 above? Any concerns about overlap > or > > > gaps after the seeding is performed? > > > > > > Good point. Right now, if the Streaming and Batch Profiler overlap the > > > last write wins. And presumably the output of the Streaming and Batch > > > Profiler are the same, so no worries, right? :) > > > > > > So it kind of works, but it is definitely not ideal for use case 3. I > > > could add --begin and --end args to constrain the time frame over which > > the > > > Batch Profiler runs. I do not have that in the feature branch. It > would > > > be easy enough to add though. > > > > > > > > > > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > > > michael.miklav...@gmail.com> wrote: > > > > > > > Ok, makes sense. That's sort of what I was thinking as well, Nick. > > > Pulling > > > > at this thread just a bit more... > > > > > > > >1. I have an existing system that's been up a while, and I have > > added > > > k > > > >profiles - assume these are the first profiles I've created. > > > > 1. I would have t0 - tm (where m is the time when the profiles > > were > > > > first installed) worth of data that has not been profiled yet. > > > > 2. The batch profiler process would be to take that exact > profile > > > > definition from ZK and run the batch loader with that from the > > CLI. > > > > 3. Profiles are now up to date from t0 - tCurrent > > > >2. I've already done #1 above. Time goes by and now I want to add > a > > > new > > > >profile. > > > > 1. Same first step above > > > > 2. I would run the batch loader with *only* that new profile > > > > definition to seed? > > > > > > > > Forgive me if I missed this in PR's and discussion in the FB, but how > > do > > > we > > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps > after > > > the > > > > seeding is performed? > > > > > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen > > wrote: > > > > > > > > > I think more often than not, you would want to load your profile > > > > definition > > > > > from a file. This is why I considered the 'load from Zk' more of a > > > > > nice-to-have. > > > > > > > > > >- In use case 1 and 2, this would definitely be the case. The > > > > profiles > > > > >I am working with are speculative and I am using the batch > > profiler > > > to > > > > >determine if they are worth keeping. In this case, my > speculative > > > > > profiles > > > > >would not be in Zk (yet). > > > > >- In use case 3, I could see it go either way. It might be > useful > > > to > > > > >load from Zk, but it certainly isn't a blocker. > > > > > > > > > > > > > > > > So if the config does not correctly match the profiler config > held > > in > > > > ZK > > > > > and > > > > > the user runs the batch seeding job, what happens? > > > > > > > > > > You would just get a profile that is slightly different over the > > entire > > > > > time span. This is not a new risk. If the user changes their > > Profile > > > > > definitions in Zk, the same thing would happen. > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 12:15 PM Michael Miklavcic < > > > > > michael.miklav...@gmail.com> wrote: > > > > > > > > > > > I think I'm torn on this, specifically because it's batch and > would > > > > > > generally be run as-needed. Justin, can you elaborate on your > > > concerns > > > > > > there? This feels functionally very similar to our flat file > > loaders, > > > > > which > > > > > > all have inputs for config from the CLI only. On the other hand, > > our > > > > flat > > > > > > file loaders are not typically seeding an existing structure. My > > > > concern > > > > > of > > > > > > a local file profiler config stems from this stated goal: > > > > > > > The goal would be to enable “profile seeding” which allows > > profiles > > > > to > > > > > be > > > > > > populated from a time before the profile was created. > > > > > > So if the config does not correctly match the profiler config > held > > in > > > >
Re: [DISCUSS] Batch Profiler Feature Branch
Yeah, agreed. Per use case 3, when deploying to production there really wouldn't be a huge overlap like 3 months of already profiled data. Its day 1, the profile was just deployed around the same time as you are running the Batch Profiler, so the overlap is in minutes, maybe hours. But I can definitely see the usefulness of the feature for re-runs, etc as you have described. Based on this discussion, I created a few JIRAs. Thanks all for the great feedback and keep it coming. [1] METRON-1787 - Input Time Constraints for Batch Profiler [2] METRON-1788 - Fetch Profile Definitions from Zk for Batch Profiler [3] METRON-1789 - MPack Should Define Default Input Path for Batch Profiler -- [1] https://issues.apache.org/jira/browse/METRON-1787 [2] https://issues.apache.org/jira/browse/METRON-1788 [3] https://issues.apache.org/jira/browse/METRON-1789 On Thu, Sep 20, 2018 at 1:34 PM Michael Miklavcic < michael.miklav...@gmail.com> wrote: > I think we might want to allow the flexibility to choose the date range > then. I don't yet feel like I have a good enough understanding of all the > ways in which users would want to seed to force them to run the batch job > over all the data. It might also make it easier to deal with remediation, > ie an error doesn't force you to re-run over the entire history. Same goes > for testing out the profile seeing batch job in the first place. > > On Thu, Sep 20, 2018 at 11:23 AM Nick Allen wrote: > > > Assuming you have 9 months of data archived, yes. > > > > On Thu, Sep 20, 2018 at 1:22 PM Michael Miklavcic < > > michael.miklav...@gmail.com> wrote: > > > > > So in the case of 3 - if you had 6 months of data that hadn't been > > profiled > > > and another 3 that had been profiled (9 months total data), in its > > current > > > form the batch job runs over all 9 months? > > > > > > On Thu, Sep 20, 2018 at 11:13 AM Nick Allen > wrote: > > > > > > > > How do we establish "tm" from 1.1 above? Any concerns about overlap > > or > > > > gaps after the seeding is performed? > > > > > > > > Good point. Right now, if the Streaming and Batch Profiler overlap > the > > > > last write wins. And presumably the output of the Streaming and > Batch > > > > Profiler are the same, so no worries, right? :) > > > > > > > > So it kind of works, but it is definitely not ideal for use case 3. > I > > > > could add --begin and --end args to constrain the time frame over > which > > > the > > > > Batch Profiler runs. I do not have that in the feature branch. It > > would > > > > be easy enough to add though. > > > > > > > > > > > > > > > > On Thu, Sep 20, 2018 at 12:41 PM Michael Miklavcic < > > > > michael.miklav...@gmail.com> wrote: > > > > > > > > > Ok, makes sense. That's sort of what I was thinking as well, Nick. > > > > Pulling > > > > > at this thread just a bit more... > > > > > > > > > >1. I have an existing system that's been up a while, and I have > > > added > > > > k > > > > >profiles - assume these are the first profiles I've created. > > > > > 1. I would have t0 - tm (where m is the time when the > profiles > > > were > > > > > first installed) worth of data that has not been profiled > yet. > > > > > 2. The batch profiler process would be to take that exact > > profile > > > > > definition from ZK and run the batch loader with that from > the > > > CLI. > > > > > 3. Profiles are now up to date from t0 - tCurrent > > > > >2. I've already done #1 above. Time goes by and now I want to > add > > a > > > > new > > > > >profile. > > > > > 1. Same first step above > > > > > 2. I would run the batch loader with *only* that new profile > > > > > definition to seed? > > > > > > > > > > Forgive me if I missed this in PR's and discussion in the FB, but > how > > > do > > > > we > > > > > establish "tm" from 1.1 above? Any concerns about overlap or gaps > > after > > > > the > > > > > seeding is performed? > > > > > > > > > > On Thu, Sep 20, 2018 at 10:26 AM Nick Allen > > > wrote: > > > > > > > > > > > I think more often than not, you would want to load your profile > > > > > definition > > > > > > from a file. This is why I considered the 'load from Zk' more > of a > > > > > > nice-to-have. > > > > > > > > > > > >- In use case 1 and 2, this would definitely be the case. The > > > > > profiles > > > > > >I am working with are speculative and I am using the batch > > > profiler > > > > to > > > > > >determine if they are worth keeping. In this case, my > > speculative > > > > > > profiles > > > > > >would not be in Zk (yet). > > > > > >- In use case 3, I could see it go either way. It might be > > useful > > > > to > > > > > >load from Zk, but it certainly isn't a blocker. > > > > > > > > > > > > > > > > > > > So if the config does not correctly match the profiler config > > held > > > in > > > > > ZK > > > > > > and > > > > > > the user runs the batch seeding job, what happens? > > > > > > > >