Re: [E] Theta Serialize/Deserialize and then update?

Karl Matthias Sat, 28 Aug 2021 06:31:39 -0700

Hi Lee,

Thanks very much for this. I had missed that the Union supported updates. I
had thought I needed to get the result from it first, but that also returns
a CompactSketch which your reasoning explains well. Really appreciate both
of you guys helping me out.


Cheers,
Karl

On Fri, Aug 27, 2021 at 1:04 AM leerho <lee...@gmail.com> wrote:

> Hi Karl,
>   I just want to explain the reasons you cannot create an UpdateSketch
> directly from a CompactSketch:
>
> The CompactSketch is by definition immutable and has the smallest
> footprint and simplest structure.  It is produced as the result of all of
> the set operations because the set operations enable "merging" of sketches
> with different values of "K".  Thus the CompactSketch has no concept of
> "K".  It is just a list of hashes and a value of Theta. You can perform all
> the operations with a CompactSketch that you can with an UpdateSketch,
> except updating it with more input data.  Merging CompactSketches is faster
> than merging UpdateSketches because of the simpler structure, and, if you
> specify "ordered" (the default) when retrieving your CompactSketch, merging
> becomes extremely fast.
>
> Note that the theta Union provides a toByteArray(), union(Memory) as well
> as update(raw datums) operations. So you can always use the Union operator
> instead of the UpdateSketch for all updating and merging operations.  If
> you need to serialize (e.g, for transport or storage, etc.) you can
>
>    - byteArray = union.toByteArray()
>    - <transport>
>    - mem = Memory.wrap(byteArray)
>    - union2 = //create new Union with SetOperationBuilder...
>    - union2.union(mem)
>    - //now you can continue to update(datums) with the union2, and/or
>    perform more union operations.
>
> Lee.
>
> On Thu, Aug 26, 2021 at 10:39 AM Karl Matthias <k...@community.com> wrote:
>
>> Thanks for that. I figured out how to manage it in the Java lib. You need
>> to use a WritableMemory to wrap the byte array and then explicitly
>> instantiate an UpdateSketch with the WritableMemory. This is now working
>> and I'm doing some prototyping. Ideally I could use this from the C++
>> library as well, but I will work with the Java lib for now while
>> investigating.
>>
>> I will spend some time seeing if I can simplify a series model to do what
>> I want.
>>
>> On Thu, Aug 26, 2021 at 12:07 AM Alexander Saydakov <
>> sayda...@verizonmedia.com> wrote:
>>
>>> I believe that Java code still has the functionality to serialize and
>>> deserialize updatable Theta sketches. You point to a "wrap" operation,
>>> which is one of two ways to deserialize: heapify (instantiate an object on
>>> heap from a given chunk of bytes, involves copying data) and wrap (directly
>>> operate on a given chunk of bytes, often off-heap)
>>>
>>> Perhaps you could explain your use case a little more? What would the
>>> life cycle of your sketches be? When would you serialize them? When
>>> deserialize? How many do you anticipate to keep overall? How many would you
>>> like to update? What is the reason for serializing? And so on.
>>>
>>> On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <k...@community.com>
>>> wrote:
>>>
>>>> Thank you, I will dig around the old source and see if I can find it.
>>>> AFAICT it was already removed from the Java implementation as well [1]. You
>>>> can serialize an UpdateSketch but when deserializing they are read-only.
>>>>
>>>> I do deeply understand time series data (I was on the team that
>>>> designed the second generation metrics pipeline at New Relic) but the
>>>> problem I'm trying to solve is not nicely modeled as a time series. Of
>>>> course that is possible, but doing it that way will require much more data
>>>> and many more calculations than I want at reporting time. The reported data
>>>> will always be for all time. So modeling as a time series will require an
>>>> increasingly large number of sketches, and possibly thus also a periodic
>>>> roll-up/compaction phase. None of which is necessary if I can simply update
>>>> the same sketch—really a set of them representing various dimensions—until
>>>> I rebuild it/them from the source events on a periodic basis. It is also
>>>> too much cardinality across too many dimensions to use the sketches simply
>>>> as a roll-up tool for distinct counting on the original data.
>>>>
>>>> I was hoping a private fork wasn't necessary to do it, but I can
>>>> understand that you folks intentionally chose not to support it. I will
>>>> have a go at it and see what I can make work.
>>>>
>>>> Thanks for the replies!
>>>>
>>>> [1]
>>>> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>>>>
>>>> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
>>>> sayda...@verizonmedia.com> wrote:
>>>>
>>>>> It is possible, and we used to have serialization and deserialization
>>>>> of updatable Theta sketches. At some point we decided that it is more
>>>>> confusing than useful and might encourage anti-patterns in big systems
>>>>> (such as deserialize-update-serialize sequences on every update). So we
>>>>> removed this functionality from the C++ code, but not from Java (yet).
>>>>> Again, I would suggest treating serialization as finalizing a sketch.
>>>>> If you want to update it, create a fresh one for this new time frame or
>>>>> whatever classifier makes sense (batch, session, transaction). Hopefully
>>>>> this new sketch can be kept for updating for a while (unlit some
>>>>> close-of-books for a period of time or until the whole batch is processed
>>>>> or something). Finalized sketches can be easily merged as needed. Say, you
>>>>> create a new sketch every minute and serialize the previous one. Later you
>>>>> can have your report to show the last 60-min rolling window or a calendar
>>>>> day or something like that by aggregating the appropriate set of sketches
>>>>> for that report.
>>>>>
>>>>>
>>>>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <k...@community.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>>>>> want actually is a summary representation of the current set, which I
>>>>>> update over time and eventually replace entirely. It's an evented system
>>>>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>>>>> entirely at any time, but if maintained live they would be a fast
>>>>>> approximation that is combinable with other Theta sketches. Ideally I 
>>>>>> would
>>>>>> not have to keep them all in memory to do that and could serialize and
>>>>>> deserialize at will.
>>>>>>
>>>>>> It sounds like it's not currently implemented. But if I can manage
>>>>>> the code to do it, it is possible?
>>>>>>
>>>>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>>>>> sayda...@verizonmedia.com> wrote:
>>>>>>
>>>>>>> Is there a good reason to necessarily update the same sketch you
>>>>>>> decided to serialize?
>>>>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>>>>> system these sketches would represent different time periods or 
>>>>>>> different
>>>>>>> categories or something like that. Later on you may want to merge 
>>>>>>> (union)
>>>>>>> some of them to obtain an estimate for a longer time frame or a total
>>>>>>> across categories and so on.
>>>>>>>
>>>>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <k...@community.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I am working with both the Java library and the C++ library and the
>>>>>>>> Theta sketch.
>>>>>>>>
>>>>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>>>>> CompactSketch doesn't support updates when an UpdateSketch is 
>>>>>>>> serialized
>>>>>>>> and loaded, it is read-only.
>>>>>>>>
>>>>>>>> From looking at the Java code it seems like it would be possible to
>>>>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>>>>> something that already does this? Or is it not possible?
>>>>>>>>
>>>>>>>> Many thanks
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Reply via email to