Re: Multiple rollups/facets in one streaming aggregation?

2017-07-30 Thread Peter Shmukler
I need to improve user experience on facets calculation.
Let’s assume we’ve got a time partitioned collections.
Partition1, Partition2, Partition3 …..
AliasAllPartitions unify all partitions together.
Running facets on AliasAllPartitions is very heavy synchronous operation,
user have to wait a lot of time for first result.

My suggestion is to run Partition after partition and return partial results
on some points.
It can be relevant for any aggregate, faceting and count distinct functions.
Actually I need some estimation of facets so I can use “Count Min Sketch”
and HLL in order to keep memory consumption reasonable.
Interface can be like below:
CMSFacet(
list(
search(partition1,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition2,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition3,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition4,q=*:*,fl="author,name,price",qt="/export",sort="name
asc"),
search(partition5,q=*:*,fl="author,name,price",qt="/export",sort="name asc")
),
bucketSizeLimit=150, sizeLimit=400,sum(price),min(price), CMScount(name)
)

Expected output:
{
  "result-set": {
"docs": [
  {
"min(price)": "215464",
"sum(price)": "23545846",
"CMScount(name)": {“rows”:149,”facet”:[{“A Clash of Kings28”:4},{“A
Clash of Kings16”:4},{“A Clash of Kings27”:4},{“A Clash of Kings15”:4},{“A
Clash of Kings26”:4},{“A Clash of Kings14”:4},{“A Clash of Kings25”:4},{“A
Clash of Kings19”:4},{“A Clash of Kings18”:4},{“A Clash of Kings29”:4},{“A
Game of Thrones18”:6},{“A Clash of Kings20”:4},{“A Clash of Kings13”:4},{“A
Clash of Kings24”:4},{“A Clash of Kings12”:4},{“A Clash of Kings23”:4},{“A
Clash of Kings22”:4},{“A Clash of Kings10”:4},{“A Clash of Kings21”:4},{“A
Clash of Kings5”:4},]}
  },
  {
"min(price)": "655464",
"sum(price)": "3584684646846",
"CMScount(name)": {“rows”:299,”facet”:[{“A Storm of Swords18”:8},{“A
Game of Thrones18”:8},{“A Game of Thrones28”:7},{“A Game of
Thrones27”:7},{“A Game of Thrones24”:5},{“A Game of Thrones3”:11},{“A Game
of Thrones4”:10},{“A Game of Thrones6”:8},{“A Storm of Swords20”:7},{“A Game
of Thrones8”:6},{“A Game of Thrones9”:7},{“A Storm of Swords11”:8},{“A Storm
of Swords22”:8},{“A Storm of Swords21”:10},{“A Storm of Swords13”:8},{“A
Storm of Swords24”:8},{“A Storm of Swords23”:13},{“A Storm of
Swords15”:7},{“A Storm of Swords26”:8},{“A Storm of Swords27”:7},]}
  },
  {
"min(price)": -214.87158,
"sum(price)": -40523.873622472,
"CMScount(name)": {“rows”:399,”facet”:[{“A Storm of
Swords18”:12},{“A Game of Thrones18”:12},{“A Game of Thrones28”:11},{“A Game
of Thrones27”:11},{“A Game of Thrones24”:15},{“A Game of Thrones3”:11},{“A
Game of Thrones4”:10},{“A Game of Thrones6”:12},{“A Storm of
Swords20”:7},{“A Game of Thrones8”:6},{“A Game of Thrones9”:7},{“A Storm of
Swords11”:12},{“A Storm of Swords22”:8},{“A Storm of Swords21”:10},{“A Storm
of Swords13”:12},{“A Storm of Swords24”:12},{“A Storm of Swords23”:13},{“A
Storm of Swords15”:7},{“A Storm of Swords26”:12},{“A Storm of
Swords27”:11},]}
  },
  {
"EOF": true,
"RESPONSE_TIME": 4381
  }
]
  }
}
I wrote some prototype for this functionality on base of Solr 7. 
I implemented class CMSFacetStream extends TupleStream implements
Expressible and class CMSMetric extends Metric.

My current issues:
-   I return results tuples as soon as I achieve bucketSizeLimit, but I 
don’t
see response of partial result. 
-   How can I return Json object from Metric class?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-rollups-facets-in-one-streaming-aggregation-tp4291952p4348260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple rollups/facets in one streaming aggregation?

2016-09-01 Thread Alessandro Benedetti
It is not possible to sort on multivalued fields out of the box ... (
indeed if you don't specify any logic, which value should considered as
reference to sort ? )
Are you trying to sort or this happens by default ?

Cheers

On Thu, Sep 1, 2016 at 10:18 AM, subramani.new <subramani@gmail.com>
wrote:

> Hello,
>
>   I am exploring solr and its new feature ( Parallel Sql Interface and
> Stream api ).
>
> I have tried most of the api's and works fine. But, I am facing issue with
> multivalue field.
>
> My Json input has multi value fields. I trying to aggregate those fields
> but
> I am unable to.
>
> Exception :
> can not sort on multivalued field
>
> My use case :
>
> input:
> {
> id: 1
> field1:[1,2,3],
> app.name:[watsapp,facebook,... ]
> }
> {
> id: 2
> field1:[1,2,3],
> app.name:[watsapp,facebook,... ]
> }
>
> Expected result :
> watsapp: 2
> facebook : 2
>
> I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
> suggestion?
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Multiple-rollups-facets-in-one-streaming-
> aggregation-tp4291952p4294270.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Multiple rollups/facets in one streaming aggregation?

2016-09-01 Thread subramani.new
Hello,

  I am exploring solr and its new feature ( Parallel Sql Interface and
Stream api ).

I have tried most of the api's and works fine. But, I am facing issue with
multivalue field.

My Json input has multi value fields. I trying to aggregate those fields but
I am unable to.

Exception :
can not sort on multivalued field

My use case :  
 
input:
{
id: 1
field1:[1,2,3],
app.name:[watsapp,facebook,... ]
}
{
id: 2
field1:[1,2,3],
app.name:[watsapp,facebook,... ]
}
 
Expected result :
watsapp: 2
facebook : 2

I have 2 TB data . I wanted to execute in aggmode=map_reduce. Any
suggestion?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-rollups-facets-in-one-streaming-aggregation-tp4291952p4294270.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-17 Thread Radu Gheorghe
Thanks a lot, Joel, for your very fast and informative reply!

We'll chew on this and add a Jira if we're going on this route.
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Aug 16, 2016 at 8:29 PM, Joel Bernstein  wrote:
> For the initial implementation we could skip the merge piece if that helps
> get things done faster. In this scenario the metrics could be gathered
> after some parallel operation, then there would be no need for a merge.
> Sample syntax:
>
> metrics(parallel(join())
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein  wrote:
>
>> The concept of a MetricStream was in the early designs but hasn't yet been
>> implemented. Now might be a good time to work on the implementation.
>>
>> The MetricStream wraps a stream and gathers metrics in memory, continuing
>> to emit the tuples from the underlying stream. This allows multiple
>> MetricStreams to operate over the same stream without transforming the
>> stream. Psuedo code for a metric expression syntax is below:
>>
>> metrics(metrics(search())
>>
>> The MetricStream delivers it's metrics through the EOF Tuple. So the
>> MetricStream simply adds the finished aggregations to the EOF Tuple and
>> returns it. If we're going to support parallel metric gathering then we'll
>> also need to support the merging of the metrics. Something like this:
>>
>> metrics(parallel(metrics(join())
>>
>> Where the metrics wrapping the parallel function would need to collect the
>> EOF tuples from each worker and the merge the metrics and then emit the
>> merged metrics in and EOF Tuple.
>>
>> If you think this meets your needs, feel free to create a jira and add
>> begin a patch and I can help get it committed.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
>> radu.gheor...@sematext.com> wrote:
>>
>>> Hello Solr users :)
>>>
>>> Right now it seems that if I want to rollup on two different fields
>>> with streaming expressions, I would need to do two separate requests.
>>> This is too slow for our use-case, when we need to do joins before
>>> sorting and rolling up (because we'd have to re-do the joins).
>>>
>>> Since in our case we are actually looking for some not-necessarily
>>> accurate facets (top N), the best solution we could come up with was
>>> to implement a new stream decorator that implements an algorithm like
>>> Count-min sketch[1] which would run on the tuples provided by the
>>> stream function it wraps. This would have two big wins for us:
>>> 1) it would do the facet without needing to sort on the facet field,
>>> so we'll potentially save lots of memory
>>> 2) because sorting isn't needed, we could do multiple facets in one go
>>>
>>> That said, I have two (broad) questions:
>>> A) is there a better way of doing this? Let's reduce the problem to
>>> streaming aggregations, where the assumption is that we have multiple
>>> collections where data needs to be joined, and then facet on fields
>>> from all collections. But maybe there's a better algorithm, something
>>> out of the box or closer to what is offered out of the box?
>>> B) whatever the best way is, could we do it in a way that can be
>>> contributed back to Solr? Any hints on how to do that? Just another
>>> decorator?
>>>
>>> Thanks and best regards,
>>> Radu
>>> --
>>> Performance Monitoring * Log Analytics * Search Analytics
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>>
>>
>>


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Joel Bernstein
For the initial implementation we could skip the merge piece if that helps
get things done faster. In this scenario the metrics could be gathered
after some parallel operation, then there would be no need for a merge.
Sample syntax:

metrics(parallel(join())


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 16, 2016 at 1:25 PM, Joel Bernstein  wrote:

> The concept of a MetricStream was in the early designs but hasn't yet been
> implemented. Now might be a good time to work on the implementation.
>
> The MetricStream wraps a stream and gathers metrics in memory, continuing
> to emit the tuples from the underlying stream. This allows multiple
> MetricStreams to operate over the same stream without transforming the
> stream. Psuedo code for a metric expression syntax is below:
>
> metrics(metrics(search())
>
> The MetricStream delivers it's metrics through the EOF Tuple. So the
> MetricStream simply adds the finished aggregations to the EOF Tuple and
> returns it. If we're going to support parallel metric gathering then we'll
> also need to support the merging of the metrics. Something like this:
>
> metrics(parallel(metrics(join())
>
> Where the metrics wrapping the parallel function would need to collect the
> EOF tuples from each worker and the merge the metrics and then emit the
> merged metrics in and EOF Tuple.
>
> If you think this meets your needs, feel free to create a jira and add
> begin a patch and I can help get it committed.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe <
> radu.gheor...@sematext.com> wrote:
>
>> Hello Solr users :)
>>
>> Right now it seems that if I want to rollup on two different fields
>> with streaming expressions, I would need to do two separate requests.
>> This is too slow for our use-case, when we need to do joins before
>> sorting and rolling up (because we'd have to re-do the joins).
>>
>> Since in our case we are actually looking for some not-necessarily
>> accurate facets (top N), the best solution we could come up with was
>> to implement a new stream decorator that implements an algorithm like
>> Count-min sketch[1] which would run on the tuples provided by the
>> stream function it wraps. This would have two big wins for us:
>> 1) it would do the facet without needing to sort on the facet field,
>> so we'll potentially save lots of memory
>> 2) because sorting isn't needed, we could do multiple facets in one go
>>
>> That said, I have two (broad) questions:
>> A) is there a better way of doing this? Let's reduce the problem to
>> streaming aggregations, where the assumption is that we have multiple
>> collections where data needs to be joined, and then facet on fields
>> from all collections. But maybe there's a better algorithm, something
>> out of the box or closer to what is offered out of the box?
>> B) whatever the best way is, could we do it in a way that can be
>> contributed back to Solr? Any hints on how to do that? Just another
>> decorator?
>>
>> Thanks and best regards,
>> Radu
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>>
>
>


Re: Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Joel Bernstein
The concept of a MetricStream was in the early designs but hasn't yet been
implemented. Now might be a good time to work on the implementation.

The MetricStream wraps a stream and gathers metrics in memory, continuing
to emit the tuples from the underlying stream. This allows multiple
MetricStreams to operate over the same stream without transforming the
stream. Psuedo code for a metric expression syntax is below:

metrics(metrics(search())

The MetricStream delivers it's metrics through the EOF Tuple. So the
MetricStream simply adds the finished aggregations to the EOF Tuple and
returns it. If we're going to support parallel metric gathering then we'll
also need to support the merging of the metrics. Something like this:

metrics(parallel(metrics(join())

Where the metrics wrapping the parallel function would need to collect the
EOF tuples from each worker and the merge the metrics and then emit the
merged metrics in and EOF Tuple.

If you think this meets your needs, feel free to create a jira and add
begin a patch and I can help get it committed.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Aug 16, 2016 at 11:52 AM, Radu Gheorghe 
wrote:

> Hello Solr users :)
>
> Right now it seems that if I want to rollup on two different fields
> with streaming expressions, I would need to do two separate requests.
> This is too slow for our use-case, when we need to do joins before
> sorting and rolling up (because we'd have to re-do the joins).
>
> Since in our case we are actually looking for some not-necessarily
> accurate facets (top N), the best solution we could come up with was
> to implement a new stream decorator that implements an algorithm like
> Count-min sketch[1] which would run on the tuples provided by the
> stream function it wraps. This would have two big wins for us:
> 1) it would do the facet without needing to sort on the facet field,
> so we'll potentially save lots of memory
> 2) because sorting isn't needed, we could do multiple facets in one go
>
> That said, I have two (broad) questions:
> A) is there a better way of doing this? Let's reduce the problem to
> streaming aggregations, where the assumption is that we have multiple
> collections where data needs to be joined, and then facet on fields
> from all collections. But maybe there's a better algorithm, something
> out of the box or closer to what is offered out of the box?
> B) whatever the best way is, could we do it in a way that can be
> contributed back to Solr? Any hints on how to do that? Just another
> decorator?
>
> Thanks and best regards,
> Radu
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
> [1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
>


Multiple rollups/facets in one streaming aggregation?

2016-08-16 Thread Radu Gheorghe
Hello Solr users :)

Right now it seems that if I want to rollup on two different fields
with streaming expressions, I would need to do two separate requests.
This is too slow for our use-case, when we need to do joins before
sorting and rolling up (because we'd have to re-do the joins).

Since in our case we are actually looking for some not-necessarily
accurate facets (top N), the best solution we could come up with was
to implement a new stream decorator that implements an algorithm like
Count-min sketch[1] which would run on the tuples provided by the
stream function it wraps. This would have two big wins for us:
1) it would do the facet without needing to sort on the facet field,
so we'll potentially save lots of memory
2) because sorting isn't needed, we could do multiple facets in one go

That said, I have two (broad) questions:
A) is there a better way of doing this? Let's reduce the problem to
streaming aggregations, where the assumption is that we have multiple
collections where data needs to be joined, and then facet on fields
from all collections. But maybe there's a better algorithm, something
out of the box or closer to what is offered out of the box?
B) whatever the best way is, could we do it in a way that can be
contributed back to Solr? Any hints on how to do that? Just another
decorator?

Thanks and best regards,
Radu
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

[1] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch