Participate in the ASF 25th Anniversary Campaign

2024-04-03 Thread Brian Proffitt
Hi everyone, As part of The ASF’s 25th anniversary campaign[1], we will be celebrating projects and communities in multiple ways. We invite all projects and contributors to participate in the following ways: * Individuals - submit your first contribution:

[ANNOUNCE] Apache DataSketches-java 5.0.2 Release

2024-04-02 Thread Jon Malkin
Dear Apache DataSketches community, The Apache DataSketches team is proud to announce the release of Apache DataSketches Java 5.0.2. This is the core Java component of the DataSketches library that includes all the sketch algorithms in production-ready packages. These sketches can be called

Community Over Code NA 2024 Travel Assistance Applications now open!

2024-03-27 Thread Gavin McDonald
Hello to all users, contributors and Committers! [ You are receiving this email as a subscriber to one or more ASF project dev or user mailing lists and is not being sent to you directly. It is important that we reach all of our users and contributors/committers so that they may get a chance

[ANNOUNCE] Apache DataSketches-python 5.0.1 Released

2024-03-22 Thread Jon Malkin
Hello Apache DataSketches community, We are pleased to have released DataSketches-python 5.0.1. (This happened in early February but I realized I never sent an announcement email.) This is the first stand-alone python release and comes with a number of changes: * Vastly improved API

Guarantees on binary format compatibility

2024-02-28 Thread Zachary Blanco via users
Hi, I couldn’t find any information in the documentation on the guarantees about the binary format compatibility for any of the sketches in the library. For example: Given a KLL sketch generated by v4.2.0 version of the library and serialized with the Java API’s `toByteArray()` method we could

Community Over Code Asia 2024 Travel Assistance Applications now open!

2024-02-20 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code Asia 2024 are now open! We will be supporting Community over Code Asia, Hangzhou, China July 26th - 28th, 2024. TAC exists

Community over Code EU 2024 Travel Assistance Applications now open!

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

[no subject]

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

[ANNOUNCE] Apache DataSketches C++ library v5.0.1 released

2024-01-02 Thread Jon Malkin
Hello all, We are pleased to announce the release of datasketches-cpp-5.0.1. This is a patch release to fix a couple bugs, although we received no error reports from the wild around either. The full release notes (including for the v5.0.0 release) are available at:

How to access summary info of a TupleSketch in Python

2023-12-13 Thread Zaatri, Adhham via users
Hi everyone, I’m trying to work with datasketches in Python. I’m having a bit of trouble understanding the interface for TupleSketch. I creating a custom policy which updates a counter, and created an update_tuple_sketch like so: ``` from _datasketches import TuplePolicy import datasketches

[ANNOUNCE] Apache DataSketches C++ core library version 5.0.0 released

2023-11-13 Thread Alexander Saydakov via users
Hi Apache DataSketches community, We are pleased to announce the release of datasketches-cpp-5.0.0 The release notes are available here: https://github.com/apache/datasketches-cpp/releases/tag/5.0.0

Registration open for Community Over Code North America

2023-08-28 Thread Rich Bowen
Hello! Registration is still open for the upcoming Community Over Code NA event in Halifax, NS! We invite you to register for the event https://communityovercode.org/registration/ Apache Committers, note that you have a special discounted rate for the conference at US$250. To take advantage of

TAC Applications for Community Over Code North America and Asia now open

2023-06-16 Thread Gavin McDonald
Hi All, (This email goes out to all our user and dev project mailing lists, so you may receive this email more than once.) The Travel Assistance Committee has opened up applications to help get people to the following events: *Community Over Code Asia 2023 - * *August 18th to August 20th in

Re: Building datasketches-postgresql with nix

2023-06-02 Thread Marko Mušnjak
Hi, Now there's a package with the datasketches postgres extension available in Nix packages, so any NixOS or Nix package manager user can easily install a postgres instance with the datasketches extension. For example, to set up a shell with pg and the extension, use the following in your

Building datasketches-postgresql with nix

2023-05-25 Thread Marko Mušnjak
Hi, I was playing around learning Nix and chose a little project for myself: build a postgres package with some extensions, including datasketches-postgresql. Since that took a lot of tinkering, I thought I'd share this here in case anyone ever needs it. The cause of the struggle seems to be in

[ANNOUNCE] Apache DataSketches PostgreSQL extension version 1.6.0 released

2023-05-15 Thread Alexander Saydakov via users
Hi Apache DataSketches community, We are pleased to announce the release of datasketches-postgresql-1.6.0 The release notes are available here: https://github.com/apache/datasketches-postgresql/releases/tag/1.6.0

[ANNOUNCE] Apache DataSketches C++ core library version 4.1.0 released

2023-05-03 Thread Alexander Saydakov via users
Hi Apache DataSketches community, We are pleased to announce the release of datasketches-cpp-4.1.0 The release notes are available here: https://github.com/apache/datasketches-cpp/releases/tag/4.1.0

Using Theta sketches in Data science classic algorithms

2023-01-31 Thread vijay rajan
Hi Alex, Lee and Theta sketches team, Most of the discussion in this forum has been on sketches, druid, java, PIG latin, Hadoop, Hive and so on. I would like to know if there is a forum for research and open source work on applications of Theta Sketches. Based on the research site [

aggregateByKey multiple theta sketches for countries example in Pyspark

2023-01-18 Thread Eleanor Wong
Hi, I’m trying to implement in Pyspark the Theta sketch multiple sketches for countries example in https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html. I am stuck on the aggregateByKey, after implementing a ThetaSketchSerializable class and the Add and Combine functions. Is

Re: [E] Re: How to get 'sum' for update_theta_sketch on DataSketches Python API

2023-01-17 Thread Jon Malkin
Wow, I just realized I misstated things rather significantly. We very much have tuple sketches in C++, but the python wrapper for them is a work in progress. I thought I had it ready, but it turns out there are some pretty significant limitations with the wrapper we're using (pybind11) that I now

Re: [E] Re: How to get 'sum' for update_theta_sketch on DataSketches Python API

2023-01-17 Thread Alexander Saydakov via users
Yes, Druid does this on top of the specialized Tuple sketch called ArrayOfDoublesSketch (in Java). Each key in the sketch has an array of floating-point values associated with it. PostAggregator functions can convert these columns into means and variances using

Re: How to get 'sum' for update_theta_sketch on DataSketches Python API

2023-01-16 Thread Tomer B
Thanks yeah ! (tuple sketch and not theta as you said!). I have another question please I looked at the tuple sketch I looked at: https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/tuple/aninteger/IntegerSummary.Mode.html and I see possible values of mode are: Sum,

Re: How to get 'sum' for update_theta_sketch on DataSketches Python API

2022-12-31 Thread Jon Malkin
I believe you're looking at the tuple sketch code in java, not theta sketch. We don't yet have tuple support in C++ (on which python is based). It's planned, but I haven't yet had time to sit down and figure out how to do it -- and specifically how to do so with a reasonable API. jon

How to get 'sum' for update_theta_sketch on DataSketches Python API

2022-12-30 Thread Tomer B
In python API I can do 'update_theta_sketch' and then get_estimate to get the unique count. But how can I get the sum in python for update_theta sketch? I see it's available in non python dataksetch here: public static enum Mode --> /** * The aggregation mode is the summation function.

Announcement: Apache DataSketches C++ core library version 4.0.0 released

2022-12-06 Thread Alexander Saydakov via users
Hi Apache DataSketches community, We are pleased to announce the release of datasketches-cpp-4.0.0 The release notes are available here: https://github.com/apache/datasketches-cpp/releases/tag/4.0.0 Alexander Saydakov al...@apache.org

DataSketches-cpp 3.5.1 Released!

2022-11-10 Thread Jon Malkin
Hi DS Community, We are pleased to announce the release of datasketches-cpp 3.5.1. This is a patch release that addresses a couple issues. 1. The Python wheels on pypi should now work on Apple Silicon Macs 2. We fixed a serialization bug when theta and tuple sketches when the sketch had no

Re: fdt sketch

2022-09-26 Thread Jon Malkin
Apologies for the delayed response. The short answer is that the sketch only exists in the Java library, and using that one the provided example should work fine. That said, I think it should be possible to implement it in C++ with the correct definitions of ExtractKey, Summary, and Policy. In

fdt sketch

2022-09-08 Thread Yaron Illouz
Hi I don’t find fdt sketch in the https://github.com/apache/datasketches-cpp code as describe in https://datasketches.apache.org/docs/Frequency/FrequentDistinctTuplesSketch.html I believe it is implemented because I can see example in the description //Construct the sketch FdtSketch sketch =

[ANNOUNCE] DataSketches C++/Python 3.4.0 released!

2022-05-26 Thread Jon Malkin
Hello all, We are pleased to announce the release of DataSketches C++/Python 3.4.0. This component provides a header-only C++ library, and also includes a thin wrapper to enable sketches within python. Summary of changes in this version: - addition of Quantiles sketch: the algorithm is

Re: Questions on using Theta Sketches

2022-05-25 Thread Karl Matthias
Hi Kevin, Inserting all the `visit_id`s into a ThetaSketch by day will give you a distinct "set" for the day. You can then union those across the range on demand, and get a distinct over the arbitrary date range. The one caution I would make here is that unioning a very large set of sketches, or

Questions on using Theta Sketches

2022-05-23 Thread Kevin Peng
Hi All, I am pretty new to the community and I am trying to get my head wrapped around the usage of the theta sketch python library to compute approx distinct counts. Here is my use case: - I have the following table structure: visit_id, dimension (array), date (Single GMT day i.e.

Re: [E] Re: Using Quantile sketches for additive metrics

2022-05-23 Thread Alexander Saydakov
This seems to be a convoluted way of computing the sum of all values. This is an additive metric, easy to compute exactly, no sketches needed. On Sun, May 22, 2022 at 5:56 AM vijay rajan wrote: > Hi folks (And Lee), > > I think I have found what I was looking for in quantile sketches though I >

Re: [E] Re: Using Quantile sketches for additive metrics

2022-05-22 Thread vijay rajan
Hi folks (And Lee), I think I have found what I was looking for in quantile sketches though I am not able to formulate error bounds for the same. I should have raised a PR request but I am going to write the code here. The code below estimates the volume of the quantile sketche based on the

Re: [E] Re: Using Quantile sketches for additive metrics

2022-05-03 Thread vijay rajan
Thanks Will. Please find my reply in-line below. But just to stay in line with my original question of a sketch for additive metrics, is that I can use such a sketch for on-the-fly aggregation by storing one such sketch per "dimension=value" pair without having to go to the table for aggregation.

Re: [E] Re: Using Quantile sketches for additive metrics

2022-05-02 Thread vijay rajan
Hi Will, Thanks for your response. I will send my clarifications in a day or two. Please do look at my detailed explanation & look at the datasets and results that I have shared. You should understand what I am trying to do. Essentially, an event_Id is a uuid for an event. A click stream will

Re: [E] Re: Using Quantile sketches for additive metrics

2022-05-02 Thread Will Lauer
OK, this is interesting. I've got some concerns and questions that I've put inline below. Will Will Lauer Senior Principal Architect, Audience & Advertising Reporting Data Platforms & Systems Engineering M 508 561 6427 Champaign Office 1908 S. First St Champaign, IL 61822 On Mon,

Re: Using Quantile sketches for additive metrics

2022-05-02 Thread vijay rajan
Thanks Lee. Please find my answers inline below in blue. I think you will find my use case very interesting. My next endeavor would be to make a decision tree with entropy / gini impurity measures with sketches. I am amazed at some of the results I have gotten. You may find this quite interesting.

Using Quantile sketches for additive metrics

2022-04-30 Thread leerho
Hi Vijay, Please ignore parts of my previous email. The solution is a bit more complicated. Of the three metrics only the Adspend is truly additive. Summing the category fields makes no sense. This means you have to design the implementation of the SummarySetOperations class so that it makes

Re: Using Quantile sketches for additive metrics

2022-04-30 Thread leerho
Vijay, Sorry about the delay in getting back to you. There is some critical information missing from your description and that is the domain of what you are sketching. I presume that it is User-IDs, otherwise it doesn't make sense. If this is the case I think the solution can be achieved in a

Using Quantile sketches for additive metrics

2022-04-28 Thread vijay rajan
Hi, Just like theta sketches are used for distinct count metrics like impressions and clicks, is there a sketch (perhaps quantile?) that can be used for metrics like ad_spend? If so, what are the error bounds? There is a big opportunity that I see in storing very little data in sketches (which I

Re: Frequent Distinct Tuples Sketch

2022-01-13 Thread Ben Krug
Apologies - I mixed up which user group I was reading. I thought I was in the druid users group, and distracted attention from the original question. Apologies again, Ben On Wed, Jan 12, 2022 at 11:09 PM leerho wrote: > I'd have to think about it more. But the FDT sketch was put in the >

Re: Frequent Distinct Tuples Sketch

2022-01-12 Thread leerho
I'd have to think about it more. But the FDT sketch was put in the library as an example. With tuple sketches you would have to write the code that encapsulates the tuple summary cells to do what you want and then extend the summary aggregator to do the proper merge operations. So in a sense

Re: Frequent Distinct Tuples Sketch

2022-01-12 Thread leerho
Not directly. But the FDT sketch is really pretty simple to code yourself, and is in the library as primarily an example. Nonetheless, one of the reasons that only a few of our sketches have been adapted for Druid is that Druid requires that all sketches be capable of operating off-heap. Which is

Re: Frequent Distinct Tuples Sketch

2022-01-07 Thread Ben Krug
Does druid support FDT sketches? The datasketch module docs don't list it. https://druid.apache.org/docs/latest/development/extensions-core/datasketches-extension.html On Fri, Jan 7, 2022 at 1:04 AM liupeng_wx wrote: > hi all: > > i have a question at Frequent Distinct Tuples Sketch。 a

Frequent Distinct Tuples Sketch

2022-01-07 Thread liupeng_wx
hi all: i have a question at Frequent Distinct Tuples Sketch。a multiset of tuples withNdimensions{d1,d2, d3, …, dN},FDT could base on any of dimensions andapproximatecount distinct left dimensions。eg: select appromate group by(d1,d2),count distinct {d2,...dn} fromsketches group

Re: Ad impression counting and unique users counting using data sketches

2021-09-16 Thread Kartik Mahajan
Thanks for your inputs, Karl and Lee. Regards kartik On Fri, Sep 17, 2021 at 6:15 AM leerho wrote: > Kartik, >> >> *Do you think this is a good model to solve Q2?* > > Your Q2 is in the domain of unique users. So, Yes. And, if you are using > Druid to do effectively a "select and group-by"

Re: Ad impression counting and unique users counting using data sketches

2021-09-16 Thread leerho
Kartik, > > *Do you think this is a good model to solve Q2?* Your Q2 is in the domain of unique users. So, Yes. And, if you are using Druid to do effectively a "select and group-by" of the raw data used to feed the two sketches, then just using Theta Sketches is sufficient. The Tuple Sketches

Re: Ad impression counting and unique users counting using data sketches

2021-09-16 Thread Karl Matthias
Hi Kartik, I certainly don't have the expertise with this that Lee does, but stepping back from your specific examples, to use a Theta sketch: 1. All of the sets/sketches you want to have interact together must contain the same domain values, be that User-ID or Impression-ID or

Re: Ad impression counting and unique users counting using data sketches

2021-09-15 Thread leerho
Hi Karik, The problem you describe is typical for on-line advertising and similar to ones we have worked on before. Solving this problem with sketches will provide approximate results in near-real time. However, doing so even with sketches may require considerably more resources than you may be

Ad impression counting and unique users counting using data sketches

2021-09-14 Thread Kartik Mahajan
Hi, We’re currently serving 100 Billion ad impressions per day across our 6 data centers. Out of these, we are serving about 80 Billion ad impressions in the US alone. Each ad impression can have hundreds of attributes(dimensions) e.g Country, City, Brower, OS, Custom-Parameters from web-page,

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-28 Thread Karl Matthias
Hi Lee, Thanks very much for this. I had missed that the Union supported updates. I had thought I needed to get the result from it first, but that also returns a CompactSketch which your reasoning explains well. Really appreciate both of you guys helping me out. Cheers, Karl On Fri, Aug 27,

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-26 Thread leerho
Hi Karl, I just want to explain the reasons you cannot create an UpdateSketch directly from a CompactSketch: The CompactSketch is by definition immutable and has the smallest footprint and simplest structure. It is produced as the result of all of the set operations because the set operations

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-26 Thread Karl Matthias
Thanks for that. I figured out how to manage it in the Java lib. You need to use a WritableMemory to wrap the byte array and then explicitly instantiate an UpdateSketch with the WritableMemory. This is now working and I'm doing some prototyping. Ideally I could use this from the C++ library as

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-25 Thread Alexander Saydakov
I believe that Java code still has the functionality to serialize and deserialize updatable Theta sketches. You point to a "wrap" operation, which is one of two ways to deserialize: heapify (instantiate an object on heap from a given chunk of bytes, involves copying data) and wrap (directly

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-25 Thread Karl Matthias
Thank you, I will dig around the old source and see if I can find it. AFAICT it was already removed from the Java implementation as well [1]. You can serialize an UpdateSketch but when deserializing they are read-only. I do deeply understand time series data (I was on the team that designed the

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-25 Thread Alexander Saydakov
It is possible, and we used to have serialization and deserialization of updatable Theta sketches. At some point we decided that it is more confusing than useful and might encourage anti-patterns in big systems (such as deserialize-update-serialize sequences on every update). So we removed this

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-25 Thread Karl Matthias
Thanks for the reply. Yes I could do time series sketches, but what I want actually is a summary representation of the current set, which I update over time and eventually replace entirely. It's an evented system and I want to use Theta sketches as a sort of summary. I can rebuild them entirely at

Re: [E] Theta Serialize/Deserialize and then update?

2021-08-25 Thread Alexander Saydakov
Is there a good reason to necessarily update the same sketch you decided to serialize? I would suggest considering that sketch finalized. Perhaps, in your system these sketches would represent different time periods or different categories or something like that. Later on you may want to merge

Theta Serialize/Deserialize and then update?

2021-08-25 Thread Karl Matthias
Hey folks, I am working with both the Java library and the C++ library and the Theta sketch. What I would like to do is update a sketch, save it somewhere (i.e. disk, etc), then reload it later and possibly update it then. The CompactSketch doesn't support updates when an UpdateSketch is

Postgres Performance Question

2021-07-09 Thread Matthew Farkas
Hi, My name is Matt and I'm a data engineer at Spotify. I'm testing out trying Data Sketches with Postgres, and running into some performance issues. I'm seeing merge times much slower than what I'm seeing in the docs here

Re: Vectorizing KLL Sketch

2021-04-12 Thread Gourav Kumar
Hi Edo, Thanks for the reply. I would like to help build a prototype. I will experiment with AVX2 & AVX512 instructions, and see if I get something in experiments. Thank You, Gourav On Mon, 12 Apr 2021 at 23:52, Edo Liberty wrote: > Hi Gourav > This sounds like a very good idea. I’m sure it

Vectorizing KLL Sketch

2021-04-12 Thread Gourav Kumar
Hi All, Hope you all are keeping well in these COVID times. I am using KLL Sketch for my project where I need to compute approx percentile over a stream. I have gone through the paper Streaming Quantiles Algorithms with Small Space and Update Time (arxiv.org)

Re: [External] Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-08 Thread Alex Garland
Thanks Will and Marko I don’t think we need to decrement/ retract values for any reason, and our requirements were we to use Flink SQL would not currently involve the OVER syntax. It seems today like we’ve managed to get DataSketches CPC sketch integrated okay with an aggregate function in

Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-08 Thread Marko Mušnjak
The basic streaming windowed aggregations (in the Java/Scala API, https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#aggregatefunction) don't require the retract method, but it looks like the SQL/Table API requires retract support for aggregate

Re: [E] Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-08 Thread Will Lauer
Last time I looked at the Flink API for implementing aggregators, it looked like it required a "decrement" function to remove entries from the aggregate in addition to the standard "aggregate" function to add entries to the aggregate. The documentation was unclear, but it looked like this was a

Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-08 Thread Alex Garland
Thanks all very much for the responses so far. Definitely useful but I think it might help to narrow focus if I explain a little more context of what we are trying to do. Firstly, we want to emit the profile metrics as a stream (Kafka topic) as well, which I assume would mean we wouldn’t want

Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-06 Thread Marko Mušnjak
Hi, I've implemented jobs using datasketches in Kafka Streams, Flink streaming, and in Spark batch (through the Hive UDFs provided). Things went smoothly in all setups, with the gotcha that hive UDFs represent incoming strings as utf-8 byte arrays (or something like that, i forgot by now), so if

Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-06 Thread Jon Malkin
I'll echo what Ben said -- if a pre-existing solution does what you need, certainly use that. Having said that, I want to revisit frequent directions in light of the work Charlie did on using it for ridge regression. And when I asked internally I was told that Flink is where at least my company

Re: Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-06 Thread Ben Krug
I can't answer about Spark or Flink, but as a druid person, I'll put in a plug for druid for the "if necessary" case. It can ingest from kafka and aggregate and do sketches during ingestion. (It's a whole new ballpark, though, if you're not already using it.) On Tue, Apr 6, 2021 at 9:56 AM Alex

Choice of Flink vs Spark for using DataSketches with streaming data

2021-04-06 Thread Alex Garland
Hi New to DataSketches and looking forward to using, seems like a great library. My team are evaluating it to profile streaming data (in Kafka) in 5-minute windows. The obvious options for stream processing (given experience within our org) would be either Flink or Spark Streaming. Two

Re: [E] Question on running hive HLL UDAF in spark

2020-12-21 Thread Alexander Saydakov
This looks like the following issue: https://github.com/apache/incubator-datasketches-hive/issues/34 You did not mention your version of Spark. The issue must have been addressed in Spark a long time ago. On Mon, Dec 21, 2020 at 10:10 AM Dong Jiang wrote: > Hi, > > Datasketches has out-of-box

Question on running hive HLL UDAF in spark

2020-12-21 Thread Dong Jiang
Hi, Datasketches has out-of-box HLL UDAF in hive, when I tried in spark, I got errors. Can someone explain why it is failing in spark? spark-shell --jars datasketches-memory-1.2.0-incubating.jar,datasketches-hive-1.0.0-incubating.jar,datasketches-java-1.2.0-incubating.jar spark.sql("""create

[NOTICE] URL's to our Repositories will be changing

2020-12-18 Thread leerho
Folks, Now that we have been approved for graduation by the ASF Board, the URLs to some of our assets will be changing as we transition to a Top-Level Project (TLP). For example: - GitHub Repositories, for example: https://github.com/apache/incubator-datasketches-java will become

Re: Tuple sketch question

2020-12-03 Thread Will Lauer
That looks like just what I need. And it doesn't look too hard to port to ArrayOfDouble or to my custom sketch. Will Will Lauer Senior Principal Architect, Audience & Advertising Reporting Data Platforms & Systems Engineering M 508 561 6427 1908 S. First St

Re: Tuple sketch question

2020-12-03 Thread Alexander Saydakov
Someone contributed a class called Filter for generic tuple sketches, but I don't think there is an equivalent for ArrayOfDoubles yet. On Thu, Dec 3, 2020 at 7:30 AM Will Lauer wrote: > I'm using tuple sketches (specifically the ArrayOfDoublesTupleSketch) to > do some computations, and as part

Tuple sketch question

2020-12-03 Thread Will Lauer
I'm using tuple sketches (specifically the ArrayOfDoublesTupleSketch) to do some computations, and as part of that, I need to do some set operations. I need to intersect one tuple sketch with a filtered version (filtered by tuple value) of another tuple sketch. The intersect operation support is

Re: [E] Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread leerho
Works for me now :) On Thu, Nov 19, 2020 at 9:10 AM Will Lauer wrote: > Lee, That link looks like it's working for me now. Must have been a > temporary server error. > > Will > > > > Will Lauer > > Senior Principal Architect, Audience & Advertising Reporting >

Re: [E] Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread Will Lauer
Lee, That link looks like it's working for me now. Must have been a temporary server error. Will Will Lauer Senior Principal Architect, Audience & Advertising Reporting Data Platforms & Systems Engineering M 508 561 6427 1908 S. First St Champaign, IL 61822

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread Justin Thaler
Hi Lee, I guess you mean the link to the paper on sketching subsampled data? That's strange, it's working for me. Anyway, here is more information if anyone wants to access it. The paper is entitled "Space-Efficient Estimation of Statistics over Sub-Sampled Streams" by McGregor, Pavan,

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread leerho
Hi Justin, the site you referenced returns an error 500 (internal server error). It might be down, or out-of-service. You might also check to make sure it is the correct URL. Thanks! Lee. On Thu, Nov 19, 2020 at 6:05 AM Justin Thaler wrote: > I think the way to think about this is the

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread Justin Thaler
I think the way to think about this is the following. If you downsample and then sketch, there are two sources of error: sampling error and sketching error. The former refers to how much the answer to your query over the sample deviates from the answer over the original data, while the second

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-19 Thread Sergio Castro
Thanks a lot for your answers to my first question, Lee and Justin. Justin, regarding this observation: "*All of that said, the library will not be able to say anything about what errors the user should expect if the data is pre-sampled, because in such a situation there are many factors that are

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-18 Thread Justin Thaler
Lee's response is correct, but I'll elaborate slightly (hopefully this is helpful instead of confusing). There are some queries for which the following is true: if the data sample is uniform from the original (unsampled) data, then accurate answers with respect to the sample are also accurate

Re: Consequences of sampling before analyzing data with DataSketches

2020-11-18 Thread leerho
Sorry, if you presample your data all bets are off in terms of accuracy. On Wed, Nov 18, 2020 at 10:55 AM Sergio Castro wrote: > Hi, I am new to DataSketches. > > I know Datasketches provides an *approximate* calculation of statistics > with *mathematically proven error bounds*. > > My

Consequences of sampling before analyzing data with DataSketches

2020-11-18 Thread Sergio Castro
Hi, I am new to DataSketches. I know Datasketches provides an *approximate* calculation of statistics with *mathematically proven error bounds*. My question is: Say that I am constrained to take a sampling of the original data set before handling it to Datasketches (for example, I cannot take

Re: [E] Re: Memory usage of frequent items datasketches-cpp package

2020-09-16 Thread Alexander Saydakov
You can control the sketch size, which is, in this case of the Frequent Items sketch, the maximum size of the hash table. It is never exceeded. The threshold for purging is at 3/4, I believe. Purging discards approximately half of the entries. So the hash table oscillates between 1/4 and 3/4 of

Re: [E] Re: Memory usage of frequent items datasketches-cpp package

2020-09-16 Thread Andy Dang
We're running the package in a memory-tight environment and would like to minimize the memory overhead with a hard limit for the process, and was wondering if it's possible. - Andy On Wed, Sep 16, 2020 at 9:01 AM Alexander Saydakov < sayda...@verizonmedia.com> wrote: > Why would you trigger a

Re: Memory usage of frequent items datasketches-cpp package

2020-09-15 Thread Andy Dang
Scrap this. Coming from the JVM library I embarrassingly misunderstood the size parameter in the Python API (in Java you give the actual size, in Python you give the log 2 of the size). On the other hand, is it possible to trigger a compaction explicitly or that is not supported? - Andy On

Memory usage of frequent items datasketches-cpp package

2020-09-15 Thread Andy Dang
Hi, I was running some benchmark with the CPP package and I noticed some strange memory behavior. I noticed that the memory seems to increase linearly with the item size when using size 32 or 64. The notebook si https://suspicious-bassi-380e27.netlify.app/

Re: [E] Re: HLL Union and lgK config

2020-09-14 Thread Marko Mušnjak
Hi, I just wanted to confirm that simply converting the strings to charArray worked fine - the sketches from the hive library merged with the kstreams sketches now produce correct results. Thanks again for the help! On Fri, 14 Aug 2020 at 22:51, Marko Mušnjak wrote: > Hi, > > It does seem the

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if you wish to suggest some ideas! :) On Fri, Aug 14, 2020 at 4:06 PM leerho wrote: > The other option would be to deprecate the Hive SketchState update(...) > method and create a "newUpdate(...) method that has strings

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
The other option would be to deprecate the Hive SketchState update(...) method and create a "newUpdate(...) method that has strings encode with UTF-8. And also document the reason why. Any other ideas? On Fri, Aug 14, 2020 at 4:03 PM leerho wrote: > Yep! It turns out that there is already

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
Yep! It turns out that there is already an issue on this that was reported 18 days ago. Changing this will be fraught with problems as other Hive users may have a history of sketches created with Strings encoded as char[]. I'm not

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi, It does seem the first two days (probably from Spark+Hive UDFs) merged by themselves, closely match the exact count of 11034. The other 12 days (built using Kafka Streams) taken together also closely match the exact count for the period. That would mean we have our cause here. Now to

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, As I stated before the first 2 sketches are the result of union operations, while the rest are not. I get the following: All 14 sketches : 34530 Without the first day : 27501; your count 24890; Error = 10.5% This is already way off. it represents an error of nearly 7 standard

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread Alexander Saydakov
Since you are mixing sketches built in different environments, have you ever tested that the input strings are hashed the same way? There is a chance that strings might be represented differently in Hive and Spark, and therefore the resulting sketches might be disjoint while you might believe that

Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi, The sketches are string-fed. Some of the sketches are built using Spark and the Hive functions from the datasketches library, while others are built using a kafka streams job. It's quite likely the covered period contains some sketches built by Spark and some by the streaming job, but I

Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, I notice that the first two sketches are the result of union operations, while the remaining sketches are pure streaming sketches. Could you perform Jon's request again except excluding the first two sketches? Just to cover the bases, could you explain the types of the data items that

Re: HLL Union and lgK config

2020-08-14 Thread Jon Malkin
Thanks! We're investigating. We'll let you know if we have further questions. jon On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak wrote: > Hi Jon, > The first sketch is the one where I see the jump. The exact count without > the first sketch is 24765. > > The result for lgK=12 without the first

Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi Jon, The first sketch is the one where I see the jump. The exact count without the first sketch is 24765. The result for lgK=12 without the first sketch is 11% off, lgK=5 is within 2%. Thanks, Marko On Fri, 14 Aug 2020 at 00:24, Jon Malkin wrote: > Hi Marko, > > Could you please let us

  1   2   >