Re: [E] Re: HLL Union and lgK config

2020-09-14 Thread Marko Mušnjak
Hi, I just wanted to confirm that simply converting the strings to charArray worked fine - the sketches from the hive library merged with the kstreams sketches now produce correct results. Thanks again for the help! On Fri, 14 Aug 2020 at 22:51, Marko Mušnjak wrote: > Hi, > > It does seem the

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if you wish to suggest some ideas! :) On Fri, Aug 14, 2020 at 4:06 PM leerho wrote: > The other option would be to deprecate the Hive SketchState update(...) > method and create a "newUpdate(...) method that has strings

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
The other option would be to deprecate the Hive SketchState update(...) method and create a "newUpdate(...) method that has strings encode with UTF-8. And also document the reason why. Any other ideas? On Fri, Aug 14, 2020 at 4:03 PM leerho wrote: > Yep! It turns out that there is already

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
Yep! It turns out that there is already an issue on this that was reported 18 days ago. Changing this will be fraught with problems as other Hive users may have a history of sketches created with Strings encoded as char[]. I'm not

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi, It does seem the first two days (probably from Spark+Hive UDFs) merged by themselves, closely match the exact count of 11034. The other 12 days (built using Kafka Streams) taken together also closely match the exact count for the period. That would mean we have our cause here. Now to

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, As I stated before the first 2 sketches are the result of union operations, while the rest are not. I get the following: All 14 sketches : 34530 Without the first day : 27501; your count 24890; Error = 10.5% This is already way off. it represents an error of nearly 7 standard

Re: [E] Re: HLL Union and lgK config

2020-08-14 Thread Alexander Saydakov
Since you are mixing sketches built in different environments, have you ever tested that the input strings are hashed the same way? There is a chance that strings might be represented differently in Hive and Spark, and therefore the resulting sketches might be disjoint while you might believe that

Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi, The sketches are string-fed. Some of the sketches are built using Spark and the Hive functions from the datasketches library, while others are built using a kafka streams job. It's quite likely the covered period contains some sketches built by Spark and some by the streaming job, but I

Re: HLL Union and lgK config

2020-08-14 Thread leerho
Hi Marko, I notice that the first two sketches are the result of union operations, while the remaining sketches are pure streaming sketches. Could you perform Jon's request again except excluding the first two sketches? Just to cover the bases, could you explain the types of the data items that

Re: HLL Union and lgK config

2020-08-14 Thread Jon Malkin
Thanks! We're investigating. We'll let you know if we have further questions. jon On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak wrote: > Hi Jon, > The first sketch is the one where I see the jump. The exact count without > the first sketch is 24765. > > The result for lgK=12 without the first

Re: HLL Union and lgK config

2020-08-14 Thread Marko Mušnjak
Hi Jon, The first sketch is the one where I see the jump. The exact count without the first sketch is 24765. The result for lgK=12 without the first sketch is 11% off, lgK=5 is within 2%. Thanks, Marko On Fri, 14 Aug 2020 at 00:24, Jon Malkin wrote: > Hi Marko, > > Could you please let us

Re: HLL Union and lgK config

2020-08-13 Thread leerho
Marko, We are working to understand this problem. Thank you for sending us the actual sketches, That helps us a great deal! Cheers, Lee. On Thu, Aug 13, 2020 at 3:24 PM Jon Malkin wrote: > Hi Marko, > > Could you please let us know two more things: > 1) Which is the one particular sketch

Re: HLL Union and lgK config

2020-08-13 Thread Jon Malkin
Hi Marko, Could you please let us know two more things: 1) Which is the one particular sketch that causes the estimate to jump? 2) What is the exact unique count of the others without that sketch? It sort of seems like the first sketch, but it's hard to know for sure since we don't know the true

HLL Union and lgK config

2020-08-13 Thread Marko Mušnjak
Hi, Could someone help me understand a behavior I see when trying to union some HLL sketches? I have 14 HLL sketches, and I know the exact unique counts for each of them. All the individual sketches give estimates within 2% of the exact counts. When I try to create a union, using the default