Hi Lee, Thanks very much for this. I had missed that the Union supported updates. I had thought I needed to get the result from it first, but that also returns a CompactSketch which your reasoning explains well. Really appreciate both of you guys helping me out.
Cheers, Karl On Fri, Aug 27, 2021 at 1:04 AM leerho <lee...@gmail.com> wrote: > Hi Karl, > I just want to explain the reasons you cannot create an UpdateSketch > directly from a CompactSketch: > > The CompactSketch is by definition immutable and has the smallest > footprint and simplest structure. It is produced as the result of all of > the set operations because the set operations enable "merging" of sketches > with different values of "K". Thus the CompactSketch has no concept of > "K". It is just a list of hashes and a value of Theta. You can perform all > the operations with a CompactSketch that you can with an UpdateSketch, > except updating it with more input data. Merging CompactSketches is faster > than merging UpdateSketches because of the simpler structure, and, if you > specify "ordered" (the default) when retrieving your CompactSketch, merging > becomes extremely fast. > > Note that the theta Union provides a toByteArray(), union(Memory) as well > as update(raw datums) operations. So you can always use the Union operator > instead of the UpdateSketch for all updating and merging operations. If > you need to serialize (e.g, for transport or storage, etc.) you can > > - byteArray = union.toByteArray() > - <transport> > - mem = Memory.wrap(byteArray) > - union2 = //create new Union with SetOperationBuilder... > - union2.union(mem) > - //now you can continue to update(datums) with the union2, and/or > perform more union operations. > > Lee. > > On Thu, Aug 26, 2021 at 10:39 AM Karl Matthias <k...@community.com> wrote: > >> Thanks for that. I figured out how to manage it in the Java lib. You need >> to use a WritableMemory to wrap the byte array and then explicitly >> instantiate an UpdateSketch with the WritableMemory. This is now working >> and I'm doing some prototyping. Ideally I could use this from the C++ >> library as well, but I will work with the Java lib for now while >> investigating. >> >> I will spend some time seeing if I can simplify a series model to do what >> I want. >> >> On Thu, Aug 26, 2021 at 12:07 AM Alexander Saydakov < >> sayda...@verizonmedia.com> wrote: >> >>> I believe that Java code still has the functionality to serialize and >>> deserialize updatable Theta sketches. You point to a "wrap" operation, >>> which is one of two ways to deserialize: heapify (instantiate an object on >>> heap from a given chunk of bytes, involves copying data) and wrap (directly >>> operate on a given chunk of bytes, often off-heap) >>> >>> Perhaps you could explain your use case a little more? What would the >>> life cycle of your sketches be? When would you serialize them? When >>> deserialize? How many do you anticipate to keep overall? How many would you >>> like to update? What is the reason for serializing? And so on. >>> >>> On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <k...@community.com> >>> wrote: >>> >>>> Thank you, I will dig around the old source and see if I can find it. >>>> AFAICT it was already removed from the Java implementation as well [1]. You >>>> can serialize an UpdateSketch but when deserializing they are read-only. >>>> >>>> I do deeply understand time series data (I was on the team that >>>> designed the second generation metrics pipeline at New Relic) but the >>>> problem I'm trying to solve is not nicely modeled as a time series. Of >>>> course that is possible, but doing it that way will require much more data >>>> and many more calculations than I want at reporting time. The reported data >>>> will always be for all time. So modeling as a time series will require an >>>> increasingly large number of sketches, and possibly thus also a periodic >>>> roll-up/compaction phase. None of which is necessary if I can simply update >>>> the same sketch—really a set of them representing various dimensions—until >>>> I rebuild it/them from the source events on a periodic basis. It is also >>>> too much cardinality across too many dimensions to use the sketches simply >>>> as a roll-up tool for distinct counting on the original data. >>>> >>>> I was hoping a private fork wasn't necessary to do it, but I can >>>> understand that you folks intentionally chose not to support it. I will >>>> have a go at it and see what I can make work. >>>> >>>> Thanks for the replies! >>>> >>>> [1] >>>> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139 >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=> >>>> >>>> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov < >>>> sayda...@verizonmedia.com> wrote: >>>> >>>>> It is possible, and we used to have serialization and deserialization >>>>> of updatable Theta sketches. At some point we decided that it is more >>>>> confusing than useful and might encourage anti-patterns in big systems >>>>> (such as deserialize-update-serialize sequences on every update). So we >>>>> removed this functionality from the C++ code, but not from Java (yet). >>>>> Again, I would suggest treating serialization as finalizing a sketch. >>>>> If you want to update it, create a fresh one for this new time frame or >>>>> whatever classifier makes sense (batch, session, transaction). Hopefully >>>>> this new sketch can be kept for updating for a while (unlit some >>>>> close-of-books for a period of time or until the whole batch is processed >>>>> or something). Finalized sketches can be easily merged as needed. Say, you >>>>> create a new sketch every minute and serialize the previous one. Later you >>>>> can have your report to show the last 60-min rolling window or a calendar >>>>> day or something like that by aggregating the appropriate set of sketches >>>>> for that report. >>>>> >>>>> >>>>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <k...@community.com> >>>>> wrote: >>>>> >>>>>> Thanks for the reply. Yes I could do time series sketches, but what I >>>>>> want actually is a summary representation of the current set, which I >>>>>> update over time and eventually replace entirely. It's an evented system >>>>>> and I want to use Theta sketches as a sort of summary. I can rebuild them >>>>>> entirely at any time, but if maintained live they would be a fast >>>>>> approximation that is combinable with other Theta sketches. Ideally I >>>>>> would >>>>>> not have to keep them all in memory to do that and could serialize and >>>>>> deserialize at will. >>>>>> >>>>>> It sounds like it's not currently implemented. But if I can manage >>>>>> the code to do it, it is possible? >>>>>> >>>>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov < >>>>>> sayda...@verizonmedia.com> wrote: >>>>>> >>>>>>> Is there a good reason to necessarily update the same sketch you >>>>>>> decided to serialize? >>>>>>> I would suggest considering that sketch finalized. Perhaps, in your >>>>>>> system these sketches would represent different time periods or >>>>>>> different >>>>>>> categories or something like that. Later on you may want to merge >>>>>>> (union) >>>>>>> some of them to obtain an estimate for a longer time frame or a total >>>>>>> across categories and so on. >>>>>>> >>>>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <k...@community.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hey folks, >>>>>>>> >>>>>>>> I am working with both the Java library and the C++ library and the >>>>>>>> Theta sketch. >>>>>>>> >>>>>>>> What I would like to do is update a sketch, save it somewhere (i.e. >>>>>>>> disk, etc), then reload it later and possibly update it then. The >>>>>>>> CompactSketch doesn't support updates when an UpdateSketch is >>>>>>>> serialized >>>>>>>> and loaded, it is read-only. >>>>>>>> >>>>>>>> From looking at the Java code it seems like it would be possible to >>>>>>>> create an UpdateSketch from the contents of a CompactSketch but there >>>>>>>> doesn't appear to be an existing method that does this. Am I missing >>>>>>>> something that already does this? Or is it not possible? >>>>>>>> >>>>>>>> Many thanks >>>>>>>> Karl >>>>>>>> >>>>>>>>