I have placed a [DISCUSS] thread on our d...@datasketches.apache.org list if
you wish to suggest some ideas! :)

On Fri, Aug 14, 2020 at 4:06 PM leerho <lee...@gmail.com> wrote:

> The other option would be to deprecate the Hive SketchState update(...)
> method and create a "newUpdate(...) method that has strings encode with
> UTF-8.  And also document the reason why.   Any other ideas?
>
> On Fri, Aug 14, 2020 at 4:03 PM leerho <lee...@gmail.com> wrote:
>
>> Yep!  It turns out that there is already an issue
>> <https://github.com/apache/incubator-datasketches-hive/issues/54> on
>> this that was reported 18 days ago. Changing this will be fraught with
>> problems as other Hive users may have a history of sketches created with
>> Strings encoded as char[].  I'm not sure I see an easy solution other than
>> documenting it & putting warnings everywhere.
>>
>> On Fri, Aug 14, 2020 at 1:51 PM Marko Mušnjak <marko.musn...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It does seem the first two days (probably from Spark+Hive UDFs) merged
>>> by themselves, closely match the exact count of 11034. The other 12 days
>>> (built using Kafka Streams) taken together also closely match the exact
>>> count for the period.
>>>
>>> That would mean we have our cause here.
>>>
>>> Now to discover how strings are represented in Spark's input files and
>>> in Avro records in Kafka... I see the
>>> org.apache.datasketches.hive.hll.SketchState::update converts strings to
>>> char array, while just updating with String in
>>> org.apache.datasketches.hll.BaseHllSketch::update  first converts to UTF-8
>>> and hashes the resulting byte array. Maybe trying with converting strings
>>> in the Kafka Streams app to char[] will be a good first step.
>>>
>>> I'll give that a try and report back.
>>>
>>> Thanks everyone for your help in finding the source of this!
>>>
>>> Kind regards,
>>> Marko
>>>
>>> On Fri, 14 Aug 2020 at 20:58, leerho <lee...@gmail.com> wrote:
>>>
>>>> Hi Marko,
>>>>
>>>> As I stated before the first 2 sketches are the result of union
>>>> operations, while the rest are not.  I get the following:
>>>>
>>>> All 14 sketches : 34530
>>>> Without the first day : 27501; your count 24890;  Error = 10.5%   This
>>>> is already way off. it represents an error of nearly 7 standard deviations,
>>>> which is huge!
>>>> Without the first and second day : 22919;  your count 22989; Error =
>>>> -0.3%   This is well within the error bounds.
>>>>
>>>> I get the same results with Library versions 1.2.0 and 1.3.0 and we get
>>>> the same results with our C++ library.  Also, the C++ library was
>>>> redesigned from the ground up.  I think it is highly unlikely we would have
>>>> such a serious bug in all three versions without it being detected
>>>> elsewhere.
>>>>
>>>> I think Alex is on the right track.  If you encode the same input IDs
>>>> differently in two different environments they are essentially distinct
>>>> from each other causing the unique count to go up.
>>>>
>>>> Please let us know what you find out.
>>>>
>>>> Cheers,
>>>>
>>>> Lee.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 14, 2020 at 9:45 AM Alexander Saydakov <
>>>> sayda...@verizonmedia.com> wrote:
>>>>
>>>>> Since you are mixing sketches built in different environments, have
>>>>> you ever tested that the input strings are hashed the same way? There is a
>>>>> chance that strings might be represented differently in Hive and Spark, 
>>>>> and
>>>>> therefore the resulting sketches might be disjoint while you might believe
>>>>> that they should represent overlapping sets. The crucial part of these
>>>>> sketches is the MurMur3 hash of the input. If hashes are different,
>>>>> the sketches are not compatible. They will represent disjoint sets.
>>>>> I would suggest trying a simple test: build sketches from a few
>>>>> predefined strings like "a", "b" and "c" in both systems and see if the
>>>>> union of those sketches does not grow.
>>>>>
>>>>> On Fri, Aug 14, 2020 at 9:13 AM Marko Mušnjak <marko.musn...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The sketches are string-fed.
>>>>>>
>>>>>> Some of the sketches are built using Spark and the Hive functions
>>>>>> from the datasketches library, while others are built using a kafka 
>>>>>> streams
>>>>>> job. It's quite likely the covered period contains some sketches built by
>>>>>> Spark and some by the streaming job, but I can't tell where the exact
>>>>>> cutoff was.
>>>>>> The Spark job is using
>>>>>> org.apache.datasketches.hive.hll.DataToSketchUDAF
>>>>>> The streaming job is building the sketches through Union objects
>>>>>> (receives a stream of sketches, makes unions out of individual pairs,
>>>>>> forwards the result as sketch).
>>>>>>
>>>>>> After some adjustments to the queries I'm running to get the exact
>>>>>> counts, to take care of local times, etc..., these should be the correct
>>>>>> values with excluded days:
>>>>>> Without first day: 24890
>>>>>> Without first and second day: 22989
>>>>>>
>>>>>> Thanks,
>>>>>> Marko
>>>>>>
>>>>>>
>>>>>> On Fri, 14 Aug 2020 at 17:08, leerho <lee...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Marko,
>>>>>>> I notice that the first two sketches are the result of union
>>>>>>> operations, while the remaining sketches are pure streaming sketches.
>>>>>>> Could you perform Jon's request again except excluding the first two
>>>>>>> sketches?
>>>>>>>
>>>>>>> Just to cover the bases, could you explain the types of the
>>>>>>> data items that are being fed to the sketches?  Are your identifiers
>>>>>>> strings, longs or what?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lee.
>>>>>>>
>>>>>>> On Thu, Aug 13, 2020 at 11:57 PM Jon Malkin <jon.mal...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks! We're investigating. We'll let you know if we have further
>>>>>>>> questions.
>>>>>>>>
>>>>>>>>   jon
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020, 11:40 PM Marko Mušnjak <
>>>>>>>> marko.musn...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Jon,
>>>>>>>>> The first sketch is the one where I see the jump. The exact count
>>>>>>>>> without the first sketch is 24765.
>>>>>>>>>
>>>>>>>>> The result for lgK=12 without the first sketch is 11% off, lgK=5
>>>>>>>>> is within 2%.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Marko
>>>>>>>>>
>>>>>>>>> On Fri, 14 Aug 2020 at 00:24, Jon Malkin <jon.mal...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Marko,
>>>>>>>>>>
>>>>>>>>>> Could you please let us know two more things:
>>>>>>>>>> 1) Which is the one particular sketch that causes the estimate to
>>>>>>>>>> jump?
>>>>>>>>>> 2) What is the exact unique count of the others without that
>>>>>>>>>> sketch?
>>>>>>>>>>
>>>>>>>>>> It sort of seems like the first sketch, but it's hard to know for
>>>>>>>>>> sure since we don't know the true leave-one-out exact counts.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>   jon
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 13, 2020 at 8:41 AM Marko Mušnjak <
>>>>>>>>>> marko.musn...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Could someone help me understand a behavior I see when trying to
>>>>>>>>>>> union some HLL sketches?
>>>>>>>>>>>
>>>>>>>>>>> I have 14 HLL sketches, and I know the exact unique counts for
>>>>>>>>>>> each of them. All the individual sketches give estimates within 2% 
>>>>>>>>>>> of the
>>>>>>>>>>> exact counts.
>>>>>>>>>>>
>>>>>>>>>>> When I try to create a union, using the default lgMaxK parameter
>>>>>>>>>>> results in total estimate that is way off (25% larger then exact 
>>>>>>>>>>> count).
>>>>>>>>>>>
>>>>>>>>>>> However, reducing the lgMaxK parameter in the union to 4 or 5
>>>>>>>>>>> gives results that are within 2.5% of the exact counts.
>>>>>>>>>>>
>>>>>>>>>>> Also, one particular sketch seems to cause the final estimate to
>>>>>>>>>>> jump - not adding that sketch to the union keeps the result close 
>>>>>>>>>>> to the
>>>>>>>>>>> exact count.
>>>>>>>>>>>
>>>>>>>>>>> Am I just seeing a very bad random error, or is there anything
>>>>>>>>>>> I'm doing wrong with the unions?
>>>>>>>>>>>
>>>>>>>>>>> Running on Java, using version 1.3.0. Just in case, the sketches
>>>>>>>>>>> are in the linked gist (hex encoded, one per line):
>>>>>>>>>>> https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351
>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/c00a72b3dfbc52e780c2980acfd98351__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs2w9fj2F9$>
>>>>>>>>>>> and the exact counts:
>>>>>>>>>>> https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c
>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/mmusnjak/dcbff67101be6cfc28ba01e63e41f73c__;!!Op6eflyXZCqGR5I!SbFmu7UEH82d0-UCIMKdWD3Se8Gm9rHONepKPDTyJgUr3aKpE5mf681Wtuzs23l57rt0$>
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>> Marko Musnjak
>>>>>>>>>>>
>>>>>>>>>>>

Reply via email to