Any volunteers for creating a set of jiras and feature branch for an object
store repository? This sounds like a massive feature.
On Jan 5, 2018 2:06 PM, "Otto Fowler" wrote:
> I would say that at the stellar author level, you would just get objects
> from the store and
Yes, abstracted.
We have an api of stellar functions that just load things from the store,
they don’t need to bleed through what the store is.
We have a ‘store’, which may be hdfs or hbase or whatever.
We have an api for adding to the store ( add etc ) that doesn’t
presume the store either.
Then
I would say that at the stellar author level, you would just get objects
from the store and the ‘override’ case would be a follow on for edge cases.
On January 5, 2018 at 14:29:16, Casey Stella (ceste...@gmail.com) wrote:
Well, you can pull the default configs from global configs, but you might
Well, you can pull the default configs from global configs, but you might
want to override them (similar to the profiler). For instance, you might
want to interact with another hbase table than the one globally configured.
On Fri, Jan 5, 2018 at 12:04 PM, Otto Fowler
I'm not sure I follow what you're saying as it pertains to summary objects.
Repository is a loaded term, and I'm very apprehensive of pushing for
something potentially very complex where a simpler solution would suffice
in the short term. To wit, the items I'm seeing in this use case doc -
I would imagine the ‘stellar-object-repo’ would be part of the global
configuration or configuration passed to the command.
why specify in the function itself?
On January 5, 2018 at 11:22:32, Casey Stella (ceste...@gmail.com) wrote:
I like that, specifically the repositories abstraction.
I like that, specifically the repositories abstraction. Perhaps we can
construct some longer term JIRAs for extensions.
For the current state of affairs (wrt to the OBJECT_GET call) I was
imagining the simple default HDFS solution as a first cut and
following on adding a repository name (e.g.
If we separate the concerns as I have state previously :
1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and possibly update the
repository, from spark,
to MR jobs to whatever you would classify the
I agree with the general sentiment that we can tailor specific use cases
via UI, and I'm worried that the use case specific solution (particularly
in light of the note that it's not even general to the class of bloom
filter problems, let alone an actually general problem) becomes more work
than
I mentioned this earlier, but I'll reiterate that I think this approach
gives us the ability to make specific use cases via a UI, or other
interface should we choose to add one, while keeping the core adaptable and
flexible. This is ideal for middle tier as I think this effectively gives
us the
It also occurs to me that even in this situation, it's not a sufficient
generalization for just Bloom, but this is a bloom filter of the output of
the all the typosquatted domains for the domain in each row. If we wanted
to hard code, we'd have to hard code specifically the bloom filter *for*
So, there is value outside of just bloom usage. The most specific example
of this would be in order to configure a bloom filter, we need to know at
least an upper bound of the number of items that are going to be added to
the bloom filter. In order to do that, we need to count the number of
I just went through these pull requests as well and also agree this is good
work. I think it's a good first pass. I would be careful with trying to boil
the ocean here. I think for the initial use case I would only support loading
the bloom filters from HDFS. If people want to pre-process
I just finished stepping through the typosquatting use case README in your
merge branch. This is really, really good work Casey. I see most of our
previous documentation issues addressed up front, e.g. special variables
are cited, all new fields explained, side effects documented. The use case
doc
I'm liking this design and growth strategy, Casey. I also think Nick and
Otto have some valid points. I always find there's a natural tension
between too little, just enough, and boiling the ocean and these discuss
threads really help drive what the short and long term visions should look
like.
There is some really cool stuff happening here, if only I’d been allowed to see
the lists over Christmas... :)
A few thoughts...
I like Otto’s generalisation of the problem to include specific local stellar
objects in a cache loaded from a store (HDFS seems a natural, but not only
place,
Oh, gotcha. That makes sense. Thanks for clarifying.
On Wed, Jan 3, 2018 at 12:15 PM, Casey Stella wrote:
> It's actually many more than 1M. There are 1M domains, each domain could
> have upwards of 300 - 1000 possible typosquatted domains.
>
> You will notice from
>
It's actually many more than 1M. There are 1M domains, each domain could
have upwards of 300 - 1000 possible typosquatted domains.
You will notice from
https://github.com/cestella/incubator-metron/tree/typosquat_merge/use-cases/typosquat_detection#generate-the-bloom-filter
that we are not adding
> Even with 5 threads, it takes an hour for the full Alexa 1m, so I think
this will impact performance
What exactly takes an hour? Adding 1M entries to a bloom filter? That
seems really high, unless I am not understanding something.
On Wed, Jan 3, 2018 at 10:17 AM, Casey Stella
Thanks for the feedback, Nick.
Regarding "IMHO, I'd rather not reinvent the wheel for text manipulation."
I would argue that we are not reinventing the wheel for text manipulation
as the extractor config exists already and we are doing a similar thing in
the flatfile loader (in fact, the code is
First off, I really do like the typosquatting use case and a lot of what
you have described.
> We need a way to generate the summary sketches from flat data for this to
> work.
> ..
>
I took this quote directly from your use case. Above is the point that I'd
like to discuss and what your
1st. You are not only one looking at the list on Dec 24th Casey, so don’t
feel bad.
2nd. Maybe we can separate this into 2 areas of concern.
1. Stellar can load objects into ‘caches’ from some repository and refer to
them.
2. The repositories
3. Some number of strategies to populate and
Oh, one more thing, while the example here is around typosquatting, this is
of use outside of that. Pretty much any large existence-style query can be
enabled via this construction (create a summary bloom filter). There are
other use-cases involving other data structures too.
On Sun, Dec 24,
I'll start this discussion off with my idea around a 2nd step that is more
adaptable. I propose the following set of stellar functions backed by
Spark in the metron-management project:
- CSV_PARSE(location, separator?, columns?) : Constructs a Spark
Dataframe for reading the flatfile
-
Hi all,
I wanted to get some feedback on a sensible plan for something. It
occurred to me the other day when considering the use-case of detecting
typosquatted domains, that one approach was to generate the set of
typosquatted domains for some set of reference domains and compare domains
as they
25 matches
Mail list logo