Thanks a lot, Joel, for your very fast and informative reply! We'll chew on this and add a Jira if we're going on this route. -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/
On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernstein <joels...@gmail.com> wrote: > For the initial implementation we could skip the merge piece if that helps > get things done faster. In this scenario the metrics could be gathered > after some parallel operation, then there would be no need for a merge. > Sample syntax: > > metrics(parallel(join()) > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein <joels...@gmail.com> wrote: > >> The concept of a MetricStream was in the early designs but hasn't yet been >> implemented. Now might be a good time to work on the implementation. >> >> The MetricStream wraps a stream and gathers metrics in memory, continuing >> to emit the tuples from the underlying stream. This allows multiple >> MetricStreams to operate over the same stream without transforming the >> stream. Psuedo code for a metric expression syntax is below: >> >> metrics(metrics(search()) >> >> The MetricStream delivers it's metrics through the EOF Tuple. So the >> MetricStream simply adds the finished aggregations to the EOF Tuple and >> returns it. If we're going to support parallel metric gathering then we'll >> also need to support the merging of the metrics. Something like this: >> >> metrics(parallel(metrics(join()) >> >> Where the metrics wrapping the parallel function would need to collect the >> EOF tuples from each worker and the merge the metrics and then emit the >> merged metrics in and EOF Tuple. >> >> If you think this meets your needs, feel free to create a jira and add >> begin a patch and I can help get it committed. >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe < >> radu.gheor...@sematext.com> wrote: >> >>> Hello Solr users :) >>> >>> Right now it seems that if I want to rollup on two different fields >>> with streaming expressions, I would need to do two separate requests. >>> This is too slow for our use-case, when we need to do joins before >>> sorting and rolling up (because we'd have to re-do the joins). >>> >>> Since in our case we are actually looking for some not-necessarily >>> accurate facets (top N), the best solution we could come up with was >>> to implement a new stream decorator that implements an algorithm like >>> Count-min sketch[1] which would run on the tuples provided by the >>> stream function it wraps. This would have two big wins for us: >>> 1) it would do the facet without needing to sort on the facet field, >>> so we'll potentially save lots of memory >>> 2) because sorting isn't needed, we could do multiple facets in one go >>> >>> That said, I have two (broad) questions: >>> A) is there a better way of doing this? Let's reduce the problem to >>> streaming aggregations, where the assumption is that we have multiple >>> collections where data needs to be joined, and then facet on fields >>> from all collections. But maybe there's a better algorithm, something >>> out of the box or closer to what is offered out of the box? >>> B) whatever the best way is, could we do it in a way that can be >>> contributed back to Solr? Any hints on how to do that? Just another >>> decorator? >>> >>> Thanks and best regards, >>> Radu >>> -- >>> Performance Monitoring * Log Analytics * Search Analytics >>> Solr & Elasticsearch Support * http://sematext.com/ >>> >>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch >>> >> >>