Re: Proposal : An extension for sketch-based statistics

2017-08-16 Thread Arnaud Fournier
Thanks to bring these subjects in the discussio Ismaël. For the second point about the standard deviation, I just want to add that this could also be added to the distribution metric. Actually I think this makes much more sense than just add a new transform for this (we can also do both).

Re: Proposal : An extension for sketch-based statistics

2017-08-14 Thread Ismaël Mejía
Kenneth’s idea of using sketches for state with the State API is really interesting, it really opens some interesting use cases, I haven’t really thought about it but I believe it is really an appealing use case for the sketches. Note that the origin of this work was in the line of statistics, in

Re: Proposal : An extension for sketch-based statistics

2017-08-12 Thread Arnaud Fournier
Hello Kenneth, thank you for your answer. I read your blog post about stateful processing and that is indeed a great feature ! So if I understood correctly we could use the combineFns to declare combiningStates so it can be used while processing elements in a DoFn. That opens up a lot more use

Re: Proposal : An extension for sketch-based statistics

2017-08-07 Thread Kenneth Knowles
This is a great development! I have wanted Beam to have a library of sketches. What Eugene is referring to is the fact that you can write Combine.perKey(combineFn) to use these in a transform but also StateSpecs.combiningState(combineFn) to use them in a stateful ParDo. So it is good to make the

Re: Proposal : An extension for sketch-based statistics

2017-08-04 Thread Arnaud Fournier
Thanks for your comments, that is very encouraging ! I have created a Jira : https://issues.apache.org/jira/browse/BEAM-2728 and a PR : https://github.com/apache/beam/pull/3686 Eugene and Lucas I saw that you already have some ideas so I put you as reviewers, I look forward to hear more from

Re: Proposal : An extension for sketch-based statistics

2017-08-03 Thread Anand Iyer
This is awesome!! Very exciting to see the addition of statistical and data-mining algorithms to Apache Beam. On Thu, Aug 3, 2017 at 2:32 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > +1, Very exciting! I have some suggestions on the exact API to expose (e.g. > I think it makes

Re: Proposal : An extension for sketch-based statistics

2017-08-03 Thread Lukasz Cwik
I'm most interested in the frequency / cardinality tools as it could be used to help improve performance automatically for combiners by detecting the few keys case or automatically handle hot keys without needing users to specify the hints when they use a combiner. On Thu, Aug 3, 2017 at 5:35 AM,

Re: Proposal : An extension for sketch-based statistics

2017-08-03 Thread Jean-Baptiste Onofré
Nice work Arnaud ;) Happy to have been able to help. Let's see what the others will think about this. Regards JB On 08/03/2017 02:32 PM, Arnaud Fournier wrote: Hello everyone, My name is Arnaud Fournier and I am a CS student. I am currently doing an internship at Talend. With the support

Proposal : An extension for sketch-based statistics

2017-08-03 Thread Arnaud Fournier
Hello everyone, My name is Arnaud Fournier and I am a CS student. I am currently doing an internship at Talend. With the support of Jean-Baptiste Onofre and Ismaël Mejia, I have been working on statistical analysis of streams with Beam, using probabilistic data structures like HyperLogLog. I