Thanks a lot, Joel, for your very fast and informative reply!

We'll chew on this and add a Jira if we're going on this route.
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernstein <joels...@gmail.com> wrote:
> For the initial implementation we could skip the merge piece if that helps
> get things done faster. In this scenario the metrics could be gathered
> after some parallel operation, then there would be no need for a merge.
> Sample syntax:
>
> metrics(parallel(join())
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein <joels...@gmail.com> wrote:
>
>> The concept of a MetricStream was in the early designs but hasn't yet been
>> implemented. Now might be a good time to work on the implementation.
>>
>> The MetricStream wraps a stream and gathers metrics in memory, continuing
>> to emit the tuples from the underlying stream. This allows multiple
>> MetricStreams to operate over the same stream without transforming the
>> stream. Psuedo code for a metric expression syntax is below:
>>
>> metrics(metrics(search())
>>
>> The MetricStream delivers it's metrics through the EOF Tuple. So the
>> MetricStream simply adds the finished aggregations to the EOF Tuple and
>> returns it. If we're going to support parallel metric gathering then we'll
>> also need to support the merging of the metrics. Something like this:
>>
>> metrics(parallel(metrics(join())
>>
>> Where the metrics wrapping the parallel function would need to collect the
>> EOF tuples from each worker and the merge the metrics and then emit the
>> merged metrics in and EOF Tuple.
>>
>> If you think this meets your needs, feel free to create a jira and add
>> begin a patch and I can help get it committed.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
>> radu.gheor...@sematext.com> wrote:
>>
>>> Hello Solr users :)
>>>
>>> Right now it seems that if I want to rollup on two different fields
>>> with streaming expressions, I would need to do two separate requests.
>>> This is too slow for our use-case, when we need to do joins before
>>> sorting and rolling up (because we'd have to re-do the joins).
>>>
>>> Since in our case we are actually looking for some not-necessarily
>>> accurate facets (top N), the best solution we could come up with was
>>> to implement a new stream decorator that implements an algorithm like
>>> Count-min sketch[1] which would run on the tuples provided by the
>>> stream function it wraps. This would have two big wins for us:
>>> 1) it would do the facet without needing to sort on the facet field,
>>> so we'll potentially save lots of memory
>>> 2) because sorting isn't needed, we could do multiple facets in one go
>>>
>>> That said, I have two (broad) questions:
>>> A) is there a better way of doing this? Let's reduce the problem to
>>> streaming aggregations, where the assumption is that we have multiple
>>> collections where data needs to be joined, and then facet on fields
>>> from all collections. But maybe there's a better algorithm, something
>>> out of the box or closer to what is offered out of the box?
>>> B) whatever the best way is, could we do it in a way that can be
>>> contributed back to Solr? Any hints on how to do that? Just another
>>> decorator?
>>>
>>> Thanks and best regards,
>>> Radu
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>>
>>
>>

Reply via email to