Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Michael Miklavcic
Any volunteers for creating a set of jiras and feature branch for an object store repository? This sounds like a massive feature. On Jan 5, 2018 2:06 PM, "Otto Fowler" wrote: > I would say that at the stellar author level, you would just get objects > from the store and

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
Yes, abstracted. We have an api of stellar functions that just load things from the store, they don’t need to bleed through what the store is. We have a ‘store’, which may be hdfs or hbase or whatever. We have an api for adding to the store ( add etc ) that doesn’t presume the store either. Then

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
I would say that at the stellar author level, you would just get objects from the store and the ‘override’ case would be a follow on for edge cases. On January 5, 2018 at 14:29:16, Casey Stella (ceste...@gmail.com) wrote: Well, you can pull the default configs from global configs, but you might

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Casey Stella
Well, you can pull the default configs from global configs, but you might want to override them (similar to the profiler). For instance, you might want to interact with another hbase table than the one globally configured. On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Michael Miklavcic
I'm not sure I follow what you're saying as it pertains to summary objects. Repository is a loaded term, and I'm very apprehensive of pushing for something potentially very complex where a simpler solution would suffice in the short term. To wit, the items I'm seeing in this use case doc -

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
I would imagine the ‘stellar-object-repo’ would be part of the global configuration or configuration passed to the command. why specify in the function itself? On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote: I like that, specifically the repositories abstraction.

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Casey Stella
I like that, specifically the repositories abstraction. Perhaps we can construct some longer term JIRAs for extensions. For the current state of affairs (wrt to the OBJECT_GET call) I was imagining the simple default HDFS solution as a first cut and following on adding a repository name (e.g.

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Otto Fowler
If we separate the concerns as I have state previously : 1. Stellar can load objects into ‘caches’ from some repository and refer to them. 2. The repositories 3. Some number of strategies to populate and possibly update the repository, from spark, to MR jobs to whatever you would classify the

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-05 Thread Justin Leet
I agree with the general sentiment that we can tailor specific use cases via UI, and I'm worried that the use case specific solution (particularly in light of the note that it's not even general to the class of bloom filter problems, let alone an actually general problem) becomes more work than

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-04 Thread Michael Miklavcic
I mentioned this earlier, but I'll reiterate that I think this approach gives us the ability to make specific use cases via a UI, or other interface should we choose to add one, while keeping the core adaptable and flexible. This is ideal for middle tier as I think this effectively gives us the

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-04 Thread Casey Stella
It also occurs to me that even in this situation, it's not a sufficient generalization for just Bloom, but this is a bloom filter of the output of the all the typosquatted domains for the domain in each row. If we wanted to hard code, we'd have to hard code specifically the bloom filter *for*

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-04 Thread Casey Stella
So, there is value outside of just bloom usage. The most specific example of this would be in order to configure a bloom filter, we need to know at least an upper bound of the number of items that are going to be added to the bloom filter. In order to do that, we need to count the number of

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread James Sirota
I just went through these pull requests as well and also agree this is good work. I think it's a good first pass. I would be careful with trying to boil the ocean here. I think for the initial use case I would only support loading the bloom filters from HDFS. If people want to pre-process

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Michael Miklavcic
I just finished stepping through the typosquatting use case README in your merge branch. This is really, really good work Casey. I see most of our previous documentation issues addressed up front, e.g. special variables are cited, all new fields explained, side effects documented. The use case doc

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Michael Miklavcic
I'm liking this design and growth strategy, Casey. I also think Nick and Otto have some valid points. I always find there's a natural tension between too little, just enough, and boiling the ocean and these discuss threads really help drive what the short and long term visions should look like.

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Simon Elliston Ball
There is some really cool stuff happening here, if only I’d been allowed to see the lists over Christmas... :) A few thoughts... I like Otto’s generalisation of the problem to include specific local stellar objects in a cache loaded from a store (HDFS seems a natural, but not only place,

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Nick Allen
Oh, gotcha. That makes sense. Thanks for clarifying. On Wed, Jan 3, 2018 at 12:15 PM, Casey Stella wrote: > It's actually many more than 1M. There are 1M domains, each domain could > have upwards of 300 - 1000 possible typosquatted domains. > > You will notice from >

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Casey Stella
It's actually many more than 1M. There are 1M domains, each domain could have upwards of 300 - 1000 possible typosquatted domains. You will notice from https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter that we are not adding

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Nick Allen
> Even with 5 threads, it takes an hour for the full Alexa 1m, so I think this will impact performance What exactly takes an hour? Adding 1M entries to a bloom filter? That seems really high, unless I am not understanding something. On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2018-01-03 Thread Casey Stella
Thanks for the feedback, Nick. Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation." I would argue that we are not reinventing the wheel for text manipulation as the extractor config exists already and we are doing a similar thing in the flatfile loader (in fact, the code is

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-26 Thread Nick Allen
First off, I really do like the typosquatting use case and a lot of what you have described. > We need a way to generate the summary sketches from flat data for this to > work. > ​..​ > I took this quote directly from your use case. Above is the point that I'd like to discuss and what your

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Otto Fowler
1st. You are not only one looking at the list on Dec 24th Casey, so don’t feel bad. 2nd. Maybe we can separate this into 2 areas of concern. 1. Stellar can load objects into ‘caches’ from some repository and refer to them. 2. The repositories 3. Some number of strategies to populate and

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Casey Stella
Oh, one more thing, while the example here is around typosquatting, this is of use outside of that. Pretty much any large existence-style query can be enabled via this construction (create a summary bloom filter). There are other use-cases involving other data structures too. On Sun, Dec 24,

Re: [DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Casey Stella
I'll start this discussion off with my idea around a 2nd step that is more adaptable. I propose the following set of stellar functions backed by Spark in the metron-management project: - CSV_PARSE(location, separator?, columns?) : Constructs a Spark Dataframe for reading the flatfile -

[DISCUSS] Generating and Interacting with serialized summary objects

2017-12-24 Thread Casey Stella
Hi all, I wanted to get some feedback on a sensible plan for something. It occurred to me the other day when considering the use-case of detecting typosquatted domains, that one approach was to generate the set of typosquatted domains for some set of reference domains and compare domains as they