I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if you wish to suggest some ideas! :)
On Fri, Aug 14, 2020 at 4:06 PM leerho <lee...@gmail.com> wrote: > The other option would be to deprecate the Hive SketchState update(...) > method and create a "newUpdate(...) method that has strings encode with > UTF-8. And also document the reason why. Any other ideas? > > On Fri, Aug 14, 2020 at 4:03 PM leerho <lee...@gmail.com> wrote: > >> Yep! It turns out that there is already an issue >> <https://github.com/apache/incubator-datasketches-hive/issues/54> on >> this that was reported 18 days ago. Changing this will be fraught with >> problems as other Hive users may have a history of sketches created with >> Strings encoded as char[]. I'm not sure I see an easy solution other than >> documenting it & putting warnings everywhere. >> >> On Fri, Aug 14, 2020 at 1:51 PM Marko Mušnjak <marko.musn...@gmail.com> >> wrote: >> >>> Hi, >>> >>> It does seem the first two days (probably from Spark+Hive UDFs) merged >>> by themselves, closely match the exact count of 11034. The other 12 days >>> (built using Kafka Streams) taken together also closely match the exact >>> count for the period. >>> >>> That would mean we have our cause here. >>> >>> Now to discover how strings are represented in Spark's input files and >>> in Avro records in Kafka... I see the >>> org.apache.datasketches.hive.hll.SketchState::update converts strings to >>> char array, while just updating with String in >>> org.apache.datasketches.hll.BaseHllSketch::update first converts to UTF-8 >>> and hashes the resulting byte array. Maybe trying with converting strings >>> in the Kafka Streams app to char[] will be a good first step. >>> >>> I'll give that a try and report back. >>> >>> Thanks everyone for your help in finding the source of this! >>> >>> Kind regards, >>> Marko >>> >>> On Fri, 14 Aug 2020 at 20:58, leerho <lee...@gmail.com> wrote: >>> >>>> Hi Marko, >>>> >>>> As I stated before the first 2 sketches are the result of union >>>> operations, while the rest are not. I get the following: >>>> >>>> All 14 sketches : 34530 >>>> Without the first day : 27501; your count 24890; Error = 10.5% This >>>> is already way off. it represents an error of nearly 7 standard deviations, >>>> which is huge! >>>> Without the first and second day : 22919; your count 22989; Error = >>>> -0.3% This is well within the error bounds. >>>> >>>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get >>>> the same results with our C++ library. Also, the C++ library was >>>> redesigned from the ground up. I think it is highly unlikely we would have >>>> such a serious bug in all three versions without it being detected >>>> elsewhere. >>>> >>>> I think Alex is on the right track. If you encode the same input IDs >>>> differently in two different environments they are essentially distinct >>>> from each other causing the unique count to go up. >>>> >>>> Please let us know what you find out. >>>> >>>> Cheers, >>>> >>>> Lee. >>>> >>>> >>>> >>>> >>>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov < >>>> sayda...@verizonmedia.com> wrote: >>>> >>>>> Since you are mixing sketches built in different environments, have >>>>> you ever tested that the input strings are hashed the same way? There is a >>>>> chance that strings might be represented differently in Hive and Spark, >>>>> and >>>>> therefore the resulting sketches might be disjoint while you might believe >>>>> that they should represent overlapping sets. The crucial part of these >>>>> sketches is the MurMur3 hash of the input. If hashes are different, >>>>> the sketches are not compatible. They will represent disjoint sets. >>>>> I would suggest trying a simple test: build sketches from a few >>>>> predefined strings like "a", "b" and "c" in both systems and see if the >>>>> union of those sketches does not grow. >>>>> >>>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <marko.musn...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> The sketches are string-fed. >>>>>> >>>>>> Some of the sketches are built using Spark and the Hive functions >>>>>> from the datasketches library, while others are built using a kafka >>>>>> streams >>>>>> job. It's quite likely the covered period contains some sketches built by >>>>>> Spark and some by the streaming job, but I can't tell where the exact >>>>>> cutoff was. >>>>>> The Spark job is using >>>>>> org.apache.datasketches.hive.hll.DataToSketchUDAF >>>>>> The streaming job is building the sketches through Union objects >>>>>> (receives a stream of sketches, makes unions out of individual pairs, >>>>>> forwards the result as sketch). >>>>>> >>>>>> After some adjustments to the queries I'm running to get the exact >>>>>> counts, to take care of local times, etc..., these should be the correct >>>>>> values with excluded days: >>>>>> Without first day: 24890 >>>>>> Without first and second day: 22989 >>>>>> >>>>>> Thanks, >>>>>> Marko >>>>>> >>>>>> >>>>>> On Fri, 14 Aug 2020 at 17:08, leerho <lee...@gmail.com> wrote: >>>>>> >>>>>>> Hi Marko, >>>>>>> I notice that the first two sketches are the result of union >>>>>>> operations, while the remaining sketches are pure streaming sketches. >>>>>>> Could you perform Jon's request again except excluding the first two >>>>>>> sketches? >>>>>>> >>>>>>> Just to cover the bases, could you explain the types of the >>>>>>> data items that are being fed to the sketches? Are your identifiers >>>>>>> strings, longs or what? >>>>>>> >>>>>>> Thanks, >>>>>>> Lee. >>>>>>> >>>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jon.mal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks! We're investigating. We'll let you know if we have further >>>>>>>> questions. >>>>>>>> >>>>>>>> jon >>>>>>>> >>>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak < >>>>>>>> marko.musn...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Jon, >>>>>>>>> The first sketch is the one where I see the jump. The exact count >>>>>>>>> without the first sketch is 24765. >>>>>>>>> >>>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5 >>>>>>>>> is within 2%. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Marko >>>>>>>>> >>>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jon.mal...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Marko, >>>>>>>>>> >>>>>>>>>> Could you please let us know two more things: >>>>>>>>>> 1) Which is the one particular sketch that causes the estimate to >>>>>>>>>> jump? >>>>>>>>>> 2) What is the exact unique count of the others without that >>>>>>>>>> sketch? >>>>>>>>>> >>>>>>>>>> It sort of seems like the first sketch, but it's hard to know for >>>>>>>>>> sure since we don't know the true leave-one-out exact counts. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> jon >>>>>>>>>> >>>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak < >>>>>>>>>> marko.musn...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Could someone help me understand a behavior I see when trying to >>>>>>>>>>> union some HLL sketches? >>>>>>>>>>> >>>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for >>>>>>>>>>> each of them. All the individual sketches give estimates within 2% >>>>>>>>>>> of the >>>>>>>>>>> exact counts. >>>>>>>>>>> >>>>>>>>>>> When I try to create a union, using the default lgMaxK parameter >>>>>>>>>>> results in total estimate that is way off (25% larger then exact >>>>>>>>>>> count). >>>>>>>>>>> >>>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5 >>>>>>>>>>> gives results that are within 2.5% of the exact counts. >>>>>>>>>>> >>>>>>>>>>> Also, one particular sketch seems to cause the final estimate to >>>>>>>>>>> jump - not adding that sketch to the union keeps the result close >>>>>>>>>>> to the >>>>>>>>>>> exact count. >>>>>>>>>>> >>>>>>>>>>> Am I just seeing a very bad random error, or is there anything >>>>>>>>>>> I'm doing wrong with the unions? >>>>>>>>>>> >>>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches >>>>>>>>>>> are in the linked gist (hex encoded, one per line): >>>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351 >>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$> >>>>>>>>>>> and the exact counts: >>>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c >>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$> >>>>>>>>>>> >>>>>>>>>>> Thank you! >>>>>>>>>>> Marko Musnjak >>>>>>>>>>> >>>>>>>>>>>