Using Quantile sketches for additive metrics

2022-04-30 Thread leerho
Hi Vijay, Please ignore parts of my previous email. The solution is a bit more complicated. Of the three metrics only the Adspend is truly additive. Summing the category fields makes no sense. This means you have to design the implementation of the SummarySetOperations class so that it makes

Re: Using Quantile sketches for additive metrics

2022-04-30 Thread leerho
Vijay, Sorry about the delay in getting back to you. There is some critical information missing from your description and that is the domain of what you are sketching. I presume that it is User-IDs, otherwise it doesn't make sense. If this is the case I think the solution can be achieved in a

Re: Frequent Distinct Tuples Sketch

2022-01-12 Thread leerho
I'd have to think about it more. But the FDT sketch was put in the library as an example. With tuple sketches you would have to write the code that encapsulates the tuple summary cells to do what you want and then extend the summary aggregator to do the proper merge operations. So in a sense

Re: Frequent Distinct Tuples Sketch

2022-01-12 Thread leerho
Not directly. But the FDT sketch is really pretty simple to code yourself, and is in the library as primarily an example. Nonetheless, one of the reasons that only a few of our sketches have been adapted for Druid is that Druid requires that all sketches be capable of operating off-heap. Which is

Re: Ad impression counting and unique users counting using data sketches

2021-09-16 Thread leerho
sample of the raw records(random uniform sampling) and >> then extrapolate the query results on the sample but multiplying it with 20 >> >> I would like to note that the above 2 queries are only the initial set of >> queries that I found interesting and probably there would be

Re: Ad impression counting and unique users counting using data sketches

2021-09-15 Thread leerho
Hi Karik, The problem you describe is typical for on-line advertising and similar to ones we have worked on before. Solving this problem with sketches will provide approximate results in near-real time. However, doing so even with sketches may require considerably more resources than you may be

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-26 Thread leerho
Hi Karl, I just want to explain the reasons you cannot create an UpdateSketch directly from a CompactSketch: The CompactSketch is by definition immutable and has the smallest footprint and simplest structure. It is produced as the result of all of the set operations because the set operations

[NOTICE] URL's to our Repositories will be changing

2020-12-18 Thread leerho
Folks, Now that we have been approved for graduation by the ASF Board, the URLs to some of our assets will be changing as we transition to a Top-Level Project (TLP). For example: - GitHub Repositories, for example: https://github.com/apache/incubator-datasketches-java will become

Re: [E] Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread leerho
izon-media/> > <http://www.instagram.com/verizonmedia> > > > > On Thu, Nov 19, 2020 at 9:57 AM leerho wrote: > >> Hi Justin, the site you referenced returns an error 500 (internal server >> error). It might be down, or out-of-service. You might also check

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread leerho
;>> implemented in the library is implicitly performing a type of downsampling >>> internally and then summarizing the sample (this is a little bit of a >>> simplification). >>> >>> Something similar is true for frequent items. However, it is not true >>> for "

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-18 Thread leerho
Sorry, if you presample your data all bets are off in terms of accuracy. On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro wrote: > Hi, I am new to DataSketches. > > I know Datasketches provides an *approximate* calculation of statistics > with *mathematically proven error bounds*. > > My

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if you wish to suggest some ideas! :) On Fri, Aug 14, 2020 at 4:06 PM leerho wrote: > The other option would be to deprecate the Hive SketchState update(...) > method and create a "newUpdate(...) method that

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
The other option would be to deprecate the Hive SketchState update(...) method and create a "newUpdate(...) method that has strings encode with UTF-8. And also document the reason why. Any other ideas? On Fri, Aug 14, 2020 at 4:03 PM leerho wrote: > Yep! It turns out that there is

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
the Kafka Streams app to char[] will be a good first step. > > I'll give that a try and report back. > > Thanks everyone for your help in finding the source of this! > > Kind regards, > Marko > > On Fri, 14 Aug 2020 at 20:58, leerho wrote: > >> Hi Marko, >&

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
e care of local times, etc..., these should be the correct >> values with excluded days: >> Without first day: 24890 >> Without first and second day: 22989 >> >> Thanks, >> Marko >> >> >> On Fri, 14 Aug 2020 at 17:08, leerho wrote: >> >>&g

Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, I notice that the first two sketches are the result of union operations, while the remaining sketches are pure streaming sketches. Could you perform Jon's request again except excluding the first two sketches? Just to cover the bases, could you explain the types of the data items that

Re: HLL Union and lgK config

2020-08-13 Thread leerho
Marko, We are working to understand this problem. Thank you for sending us the actual sketches, That helps us a great deal! Cheers, Lee. On Thu, Aug 13, 2020 at 3:24 PM Jon Malkin wrote: > Hi Marko, > > Could you please let us know two more things: > 1) Which is the one particular sketch

Re: Support for "advanced" SQL types (in HLL)

2020-07-03 Thread leerho
Csaba, These are some very thoughtful suggestions and I can see that some recommendations in this area would be useful. Our focus in our DataSketches team is really on the sketching algorithms and designing the core sketches to be very high performing, robust, accurate, and easy to integrate

Re: Regarding error bounds and confidence of apache KLL implementation

2020-06-22 Thread leerho
v < > sayda...@verizonmedia.com> wrote: > >> Adding the original poster just in case he is not subscribed to the list >> >> On Mon, Jun 22, 2020 at 7:18 PM leerho wrote: >> >>> I see a typo: What I called the Omega relation is actually Omicron (big >>&

Re: Regarding error bounds and confidence of apache KLL implementation

2020-06-22 Thread leerho
read this over at some point and double-check both of > our work :-) > > On Mon, Jun 22, 2020 at 9:14 PM leerho wrote: > >> Hello Gourav, welcome to this forum! >> >> I want to make sure you have access to and have read the code >> documentation for the K

Re: Regarding error bounds and confidence of apache KLL implementation

2020-06-22 Thread leerho
Hello Gourav, welcome to this forum! I want to make sure you have access to and have read the code documentation for the KLL sketch in addition to the papers. Although the code documentation exists for both Java and C++, it is a little easier to access the Javadocs as they are accessible from

Re: Tuple sketches question

2020-05-20 Thread leerho
Hi David, Thank you for reaching out to us. We are always interested in learning about new users and new uses of the library, especially with Tuple sketches, which we do not hear much feedback about. Let me try to address some of your questions: The Tuple Sketch is an "extension" of the Theta

Re: Public slack invitation

2020-05-20 Thread leerho
There is something wrong with that link. Meanwhile I have added your email & name on your behalf for the #datasketches channel on the-asf.slack.com workspace. Lee. On Wed, May 20, 2020 at 2:50 AM David Cromberge < david.crombe...@permutive.com> wrote: > Hello, > > I would like to join the

Re: Apache Impala integration with DataSketches HLL (C++)

2020-04-27 Thread leerho
Hi Gabor, My quick question would be that taking into account that the order of the > items provided to datasketches:hll_sketch is not deterministic is it normal > behaviour that for the same dataset I get a different estimate each time I > run my query? > I'm trying to figure out if this is due

Re: Why are so many of the classes in org.apache.datasketches.cpc final?

2020-04-25 Thread leerho
nt. > > Ron > > On Apr 24, 2020, at 3:12 PM, leerho wrote: > > Hi Ron, > > Our mission is to develop a robust sketch library *product* that can be > used in production systems in many different environments and be high > performing and binary compatible across langu