Re: Multiple rollups/facets in one streaming aggregation?
I need to improve user experience on facets calculation. Let’s assume we’ve got a time partitioned collections. Partition1, Partition2, Partition3 ….. AliasAllPartitions unify all partitions together. Running facets on AliasAllPartitions is very heavy synchronous operation, user have to wait a lot of time for first result. My suggestion is to run Partition after partition and return partial results on some points. It can be relevant for any aggregate, faceting and count distinct functions. Actually I need some estimation of facets so I can use “Count Min Sketch” and HLL in order to keep memory consumption reasonable. Interface can be like below: CMSFacet( list( search(partition1,q=*:*,fl="author,name,price",qt="/export",sort="name asc"), search(partition2,q=*:*,fl="author,name,price",qt="/export",sort="name asc"), search(partition3,q=*:*,fl="author,name,price",qt="/export",sort="name asc"), search(partition4,q=*:*,fl="author,name,price",qt="/export",sort="name asc"), search(partition5,q=*:*,fl="author,name,price",qt="/export",sort="name asc") ), bucketSizeLimit=150, sizeLimit=400,sum(price),min(price), CMScount(name) ) Expected output: { "result-set": { "docs": [ { "min(price)": "215464", "sum(price)": "23545846", "CMScount(name)": {“rows”:149,”facet”:[{“A Clash of Kings28”:4},{“A Clash of Kings16”:4},{“A Clash of Kings27”:4},{“A Clash of Kings15”:4},{“A Clash of Kings26”:4},{“A Clash of Kings14”:4},{“A Clash of Kings25”:4},{“A Clash of Kings19”:4},{“A Clash of Kings18”:4},{“A Clash of Kings29”:4},{“A Game of Thrones18”:6},{“A Clash of Kings20”:4},{“A Clash of Kings13”:4},{“A Clash of Kings24”:4},{“A Clash of Kings12”:4},{“A Clash of Kings23”:4},{“A Clash of Kings22”:4},{“A Clash of Kings10”:4},{“A Clash of Kings21”:4},{“A Clash of Kings5”:4},]} }, { "min(price)": "655464", "sum(price)": "3584684646846", "CMScount(name)": {“rows”:299,”facet”:[{“A Storm of Swords18”:8},{“A Game of Thrones18”:8},{“A Game of Thrones28”:7},{“A Game of Thrones27”:7},{“A Game of Thrones24”:5},{“A Game of Thrones3”:11},{“A Game of Thrones4”:10},{“A Game of Thrones6”:8},{“A Storm of Swords20”:7},{“A Game of Thrones8”:6},{“A Game of Thrones9”:7},{“A Storm of Swords11”:8},{“A Storm of Swords22”:8},{“A Storm of Swords21”:10},{“A Storm of Swords13”:8},{“A Storm of Swords24”:8},{“A Storm of Swords23”:13},{“A Storm of Swords15”:7},{“A Storm of Swords26”:8},{“A Storm of Swords27”:7},]} }, { "min(price)": -214.87158, "sum(price)": -40523.873622472, "CMScount(name)": {“rows”:399,”facet”:[{“A Storm of Swords18”:12},{“A Game of Thrones18”:12},{“A Game of Thrones28”:11},{“A Game of Thrones27”:11},{“A Game of Thrones24”:15},{“A Game of Thrones3”:11},{“A Game of Thrones4”:10},{“A Game of Thrones6”:12},{“A Storm of Swords20”:7},{“A Game of Thrones8”:6},{“A Game of Thrones9”:7},{“A Storm of Swords11”:12},{“A Storm of Swords22”:8},{“A Storm of Swords21”:10},{“A Storm of Swords13”:12},{“A Storm of Swords24”:12},{“A Storm of Swords23”:13},{“A Storm of Swords15”:7},{“A Storm of Swords26”:12},{“A Storm of Swords27”:11},]} }, { "EOF": true, "RESPONSE_TIME": 4381 } ] } } I wrote some prototype for this functionality on base of Solr 7. I implemented class CMSFacetStream extends TupleStream implements Expressible and class CMSMetric extends Metric. My current issues: - I return results tuples as soon as I achieve bucketSizeLimit, but I don’t see response of partial result. - How can I return Json object from Metric class? -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-rollups-facets-in-one-streaming-aggregation-tp4291952p4348260.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple rollups/facets in one streaming aggregation?
It is not possible to sort on multivalued fields out of the box ... ( indeed if you don't specify any logic, which value should considered as reference to sort ? ) Are you trying to sort or this happens by default ? Cheers On Thu, Sep 1, 2016 at 10:18 AM, subramani.new <subramani@gmail.com> wrote: > Hello, > > I am exploring solr and its new feature ( Parallel Sql Interface and > Stream api ). > > I have tried most of the api's and works fine. But, I am facing issue with > multivalue field. > > My Json input has multi value fields. I trying to aggregate those fields > but > I am unable to. > > Exception : > can not sort on multivalued field > > My use case : > > input: > { > id: 1 > field1:[1,2,3], > app.name:[watsapp,facebook,... ] > } > { > id: 2 > field1:[1,2,3], > app.name:[watsapp,facebook,... ] > } > > Expected result : > watsapp: 2 > facebook : 2 > > I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any > suggestion? > > > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/Multiple-rollups-facets-in-one-streaming- > aggregation-tp4291952p4294270.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England
Re: Multiple rollups/facets in one streaming aggregation?
Hello, I am exploring solr and its new feature ( Parallel Sql Interface and Stream api ). I have tried most of the api's and works fine. But, I am facing issue with multivalue field. My Json input has multi value fields. I trying to aggregate those fields but I am unable to. Exception : can not sort on multivalued field My use case : input: { id: 1 field1:[1,2,3], app.name:[watsapp,facebook,... ] } { id: 2 field1:[1,2,3], app.name:[watsapp,facebook,... ] } Expected result : watsapp: 2 facebook : 2 I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any suggestion? -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-rollups-facets-in-one-streaming-aggregation-tp4291952p4294270.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple rollups/facets in one streaming aggregation?
Thanks a lot, Joel, for your very fast and informative reply! We'll chew on this and add a Jira if we're going on this route. -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernsteinwrote: > For the initial implementation we could skip the merge piece if that helps > get things done faster. In this scenario the metrics could be gathered > after some parallel operation, then there would be no need for a merge. > Sample syntax: > > metrics(parallel(join()) > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein wrote: > >> The concept of a MetricStream was in the early designs but hasn't yet been >> implemented. Now might be a good time to work on the implementation. >> >> The MetricStream wraps a stream and gathers metrics in memory, continuing >> to emit the tuples from the underlying stream. This allows multiple >> MetricStreams to operate over the same stream without transforming the >> stream. Psuedo code for a metric expression syntax is below: >> >> metrics(metrics(search()) >> >> The MetricStream delivers it's metrics through the EOF Tuple. So the >> MetricStream simply adds the finished aggregations to the EOF Tuple and >> returns it. If we're going to support parallel metric gathering then we'll >> also need to support the merging of the metrics. Something like this: >> >> metrics(parallel(metrics(join()) >> >> Where the metrics wrapping the parallel function would need to collect the >> EOF tuples from each worker and the merge the metrics and then emit the >> merged metrics in and EOF Tuple. >> >> If you think this meets your needs, feel free to create a jira and add >> begin a patch and I can help get it committed. >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe < >> radu.gheor...@sematext.com> wrote: >> >>> Hello Solr users :) >>> >>> Right now it seems that if I want to rollup on two different fields >>> with streaming expressions, I would need to do two separate requests. >>> This is too slow for our use-case, when we need to do joins before >>> sorting and rolling up (because we'd have to re-do the joins). >>> >>> Since in our case we are actually looking for some not-necessarily >>> accurate facets (top N), the best solution we could come up with was >>> to implement a new stream decorator that implements an algorithm like >>> Count-min sketch[1] which would run on the tuples provided by the >>> stream function it wraps. This would have two big wins for us: >>> 1) it would do the facet without needing to sort on the facet field, >>> so we'll potentially save lots of memory >>> 2) because sorting isn't needed, we could do multiple facets in one go >>> >>> That said, I have two (broad) questions: >>> A) is there a better way of doing this? Let's reduce the problem to >>> streaming aggregations, where the assumption is that we have multiple >>> collections where data needs to be joined, and then facet on fields >>> from all collections. But maybe there's a better algorithm, something >>> out of the box or closer to what is offered out of the box? >>> B) whatever the best way is, could we do it in a way that can be >>> contributed back to Solr? Any hints on how to do that? Just another >>> decorator? >>> >>> Thanks and best regards, >>> Radu >>> -- >>> Performance Monitoring * Log Analytics * Search Analytics >>> Solr & Elasticsearch Support * http://sematext.com/ >>> >>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch >>> >> >>
Re: Multiple rollups/facets in one streaming aggregation?
For the initial implementation we could skip the merge piece if that helps get things done faster. In this scenario the metrics could be gathered after some parallel operation, then there would be no need for a merge. Sample syntax: metrics(parallel(join()) Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernsteinwrote: > The concept of a MetricStream was in the early designs but hasn't yet been > implemented. Now might be a good time to work on the implementation. > > The MetricStream wraps a stream and gathers metrics in memory, continuing > to emit the tuples from the underlying stream. This allows multiple > MetricStreams to operate over the same stream without transforming the > stream. Psuedo code for a metric expression syntax is below: > > metrics(metrics(search()) > > The MetricStream delivers it's metrics through the EOF Tuple. So the > MetricStream simply adds the finished aggregations to the EOF Tuple and > returns it. If we're going to support parallel metric gathering then we'll > also need to support the merging of the metrics. Something like this: > > metrics(parallel(metrics(join()) > > Where the metrics wrapping the parallel function would need to collect the > EOF tuples from each worker and the merge the metrics and then emit the > merged metrics in and EOF Tuple. > > If you think this meets your needs, feel free to create a jira and add > begin a patch and I can help get it committed. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe < > radu.gheor...@sematext.com> wrote: > >> Hello Solr users :) >> >> Right now it seems that if I want to rollup on two different fields >> with streaming expressions, I would need to do two separate requests. >> This is too slow for our use-case, when we need to do joins before >> sorting and rolling up (because we'd have to re-do the joins). >> >> Since in our case we are actually looking for some not-necessarily >> accurate facets (top N), the best solution we could come up with was >> to implement a new stream decorator that implements an algorithm like >> Count-min sketch[1] which would run on the tuples provided by the >> stream function it wraps. This would have two big wins for us: >> 1) it would do the facet without needing to sort on the facet field, >> so we'll potentially save lots of memory >> 2) because sorting isn't needed, we could do multiple facets in one go >> >> That said, I have two (broad) questions: >> A) is there a better way of doing this? Let's reduce the problem to >> streaming aggregations, where the assumption is that we have multiple >> collections where data needs to be joined, and then facet on fields >> from all collections. But maybe there's a better algorithm, something >> out of the box or closer to what is offered out of the box? >> B) whatever the best way is, could we do it in a way that can be >> contributed back to Solr? Any hints on how to do that? Just another >> decorator? >> >> Thanks and best regards, >> Radu >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch >> > >
Re: Multiple rollups/facets in one streaming aggregation?
The concept of a MetricStream was in the early designs but hasn't yet been implemented. Now might be a good time to work on the implementation. The MetricStream wraps a stream and gathers metrics in memory, continuing to emit the tuples from the underlying stream. This allows multiple MetricStreams to operate over the same stream without transforming the stream. Psuedo code for a metric expression syntax is below: metrics(metrics(search()) The MetricStream delivers it's metrics through the EOF Tuple. So the MetricStream simply adds the finished aggregations to the EOF Tuple and returns it. If we're going to support parallel metric gathering then we'll also need to support the merging of the metrics. Something like this: metrics(parallel(metrics(join()) Where the metrics wrapping the parallel function would need to collect the EOF tuples from each worker and the merge the metrics and then emit the merged metrics in and EOF Tuple. If you think this meets your needs, feel free to create a jira and add begin a patch and I can help get it committed. Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghewrote: > Hello Solr users :) > > Right now it seems that if I want to rollup on two different fields > with streaming expressions, I would need to do two separate requests. > This is too slow for our use-case, when we need to do joins before > sorting and rolling up (because we'd have to re-do the joins). > > Since in our case we are actually looking for some not-necessarily > accurate facets (top N), the best solution we could come up with was > to implement a new stream decorator that implements an algorithm like > Count-min sketch[1] which would run on the tuples provided by the > stream function it wraps. This would have two big wins for us: > 1) it would do the facet without needing to sort on the facet field, > so we'll potentially save lots of memory > 2) because sorting isn't needed, we could do multiple facets in one go > > That said, I have two (broad) questions: > A) is there a better way of doing this? Let's reduce the problem to > streaming aggregations, where the assumption is that we have multiple > collections where data needs to be joined, and then facet on fields > from all collections. But maybe there's a better algorithm, something > out of the box or closer to what is offered out of the box? > B) whatever the best way is, could we do it in a way that can be > contributed back to Solr? Any hints on how to do that? Just another > decorator? > > Thanks and best regards, > Radu > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch >
Multiple rollups/facets in one streaming aggregation?
Hello Solr users :) Right now it seems that if I want to rollup on two different fields with streaming expressions, I would need to do two separate requests. This is too slow for our use-case, when we need to do joins before sorting and rolling up (because we'd have to re-do the joins). Since in our case we are actually looking for some not-necessarily accurate facets (top N), the best solution we could come up with was to implement a new stream decorator that implements an algorithm like Count-min sketch[1] which would run on the tuples provided by the stream function it wraps. This would have two big wins for us: 1) it would do the facet without needing to sort on the facet field, so we'll potentially save lots of memory 2) because sorting isn't needed, we could do multiple facets in one go That said, I have two (broad) questions: A) is there a better way of doing this? Let's reduce the problem to streaming aggregations, where the assumption is that we have multiple collections where data needs to be joined, and then facet on fields from all collections. But maybe there's a better algorithm, something out of the box or closer to what is offered out of the box? B) whatever the best way is, could we do it in a way that can be contributed back to Solr? Any hints on how to do that? Just another decorator? Thanks and best regards, Radu -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch