[jira] [Commented] (SOLR-9125) CollapseQParserPlugin allocations are index based, not query based

2016-05-17 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287401#comment-15287401
 ] 

Joel Bernstein commented on SOLR-9125:
--

What I was thinking was to first run the query and get the cardinality. But 
this is really not fun as the CollapsingQParserPlugin would have to know the 
main query and all the filter queries. Doesn't sound like it would be fun to 
write or maintain.

> CollapseQParserPlugin allocations are index based, not query based
> --
>
> Key: SOLR-9125
> URL: https://issues.apache.org/jira/browse/SOLR-9125
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Jeff Wartes
>Priority: Minor
>  Labels: collapsingQParserPlugin
>
> Among other things, CollapsingQParserPlugin’s OrdScoreCollector allocates 
> space per-query for: 
> 1 int (doc id) per ordinal
> 1 float (score) per ordinal
> 1 bit (FixedBitSet) per document in the index
>  
> So the higher the cardinality of the thing you’re grouping on, and the more 
> documents in the index, the more memory gets consumed per query. Since high 
> cardinality and large indexes are the use-cases CollapseQParserPlugin was 
> designed for, I thought I'd point this out.
> My real issue is that this does not vary based on the number of results in 
> the query, either before or after collapsing, so a query that results in one 
> doc consumes the same amount of memory as one that returns all of them. All 
> of the Collectors suffer from this to some degree, but I think OrdScore is 
> the worst offender.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9125) CollapseQParserPlugin allocations are index based, not query based

2016-05-17 Thread Jeff Wartes (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287339#comment-15287339
 ] 

Jeff Wartes commented on SOLR-9125:
---

Isn't there a chicken-and-egg situation there? You need the set of matching 
docs to figure out the HLL.cardinality to specify the initial size of the map 
you're going to save the set of matching docs in? 

Or maybe collect() would just throw every doc in the FBS, and finish() would do 
all the finding group heads and collapsing?

> CollapseQParserPlugin allocations are index based, not query based
> --
>
> Key: SOLR-9125
> URL: https://issues.apache.org/jira/browse/SOLR-9125
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Jeff Wartes
>Priority: Minor
>  Labels: collapsingQParserPlugin
>
> Among other things, CollapsingQParserPlugin’s OrdScoreCollector allocates 
> space per-query for: 
> 1 int (doc id) per ordinal
> 1 float (score) per ordinal
> 1 bit (FixedBitSet) per document in the index
>  
> So the higher the cardinality of the thing you’re grouping on, and the more 
> documents in the index, the more memory gets consumed per query. Since high 
> cardinality and large indexes are the use-cases CollapseQParserPlugin was 
> designed for, I thought I'd point this out.
> My real issue is that this does not vary based on the number of results in 
> the query, either before or after collapsing, so a query that results in one 
> doc consumes the same amount of memory as one that returns all of them. All 
> of the Collectors suffer from this to some degree, but I think OrdScore is 
> the worst offender.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9125) CollapseQParserPlugin allocations are index based, not query based

2016-05-17 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287265#comment-15287265
 ] 

Joel Bernstein commented on SOLR-9125:
--

One approach that might work for switching to primitive maps, would be first to 
estimate the cardinality of the collapse values in the result set using 
hyperloglog, and then sizing the primitive map accordingly. But my guess is 
this approach is going really hurt performance quite a bit. 



> CollapseQParserPlugin allocations are index based, not query based
> --
>
> Key: SOLR-9125
> URL: https://issues.apache.org/jira/browse/SOLR-9125
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Jeff Wartes
>Priority: Minor
>  Labels: collapsingQParserPlugin
>
> Among other things, CollapsingQParserPlugin’s OrdScoreCollector allocates 
> space per-query for: 
> 1 int (doc id) per ordinal
> 1 float (score) per ordinal
> 1 bit (FixedBitSet) per document in the index
>  
> So the higher the cardinality of the thing you’re grouping on, and the more 
> documents in the index, the more memory gets consumed per query. Since high 
> cardinality and large indexes are the use-cases CollapseQParserPlugin was 
> designed for, I thought I'd point this out.
> My real issue is that this does not vary based on the number of results in 
> the query, either before or after collapsing, so a query that results in one 
> doc consumes the same amount of memory as one that returns all of them. All 
> of the Collectors suffer from this to some degree, but I think OrdScore is 
> the worst offender.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9125) CollapseQParserPlugin allocations are index based, not query based

2016-05-17 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15287208#comment-15287208
 ] 

Joel Bernstein commented on SOLR-9125:
--

Yeah, the CollapsingQParsePlugin can use a lot of memory. The original design 
goal was to increase performance for collapsing on high cardinality fields and 
large result sets, as opposed to large indexes. It was really designed to 
support fast collapse queries on large e-commerce catalogs which are still 
typically small compared to other data sets.

If we can find a way to maintain the performance and shrink the memory usage 
this would be a great thing. 



> CollapseQParserPlugin allocations are index based, not query based
> --
>
> Key: SOLR-9125
> URL: https://issues.apache.org/jira/browse/SOLR-9125
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Jeff Wartes
>Priority: Minor
>  Labels: collapsingQParserPlugin
>
> Among other things, CollapsingQParserPlugin’s OrdScoreCollector allocates 
> space per-query for: 
> 1 int (doc id) per ordinal
> 1 float (score) per ordinal
> 1 bit (FixedBitSet) per document in the index
>  
> So the higher the cardinality of the thing you’re grouping on, and the more 
> documents in the index, the more memory gets consumed per query. Since high 
> cardinality and large indexes are the use-cases CollapseQParserPlugin was 
> designed for, I thought I'd point this out.
> My real issue is that this does not vary based on the number of results in 
> the query, either before or after collapsing, so a query that results in one 
> doc consumes the same amount of memory as one that returns all of them. All 
> of the Collectors suffer from this to some degree, but I think OrdScore is 
> the worst offender.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9125) CollapseQParserPlugin allocations are index based, not query based

2016-05-17 Thread Jeff Wartes (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286940#comment-15286940
 ] 

Jeff Wartes commented on SOLR-9125:
---

I messed around a little bit, but I don't have a solution for this. I thought 
I'd file the issue anyway just to shine some light.

I had attempted to use CollapseQParserPlugin on a very large index using a 
collapse on a field whose cardinality was about 1/7th the doc count... it 
didn't go well. Worse, the issue didn't come up until pretty late in the game, 
because at low query rate and/or on smaller indexes, the problem isn't evident. 
I abandoned the attempt.

Some stuff I tried:

- I thought about replacing the FBS with a DocIdSetBuilder, but 
DelegatingCollector.finish() gets called twice, and you can't 
DocIdSetBuilder.build() twice on the same builder. We'd need to save the first 
build() result and use it to initialize a new builder for the second, but I 
wasn't convinced I understood the distinction between the two passes.
- I did one quick test where I replaced the "ords" and "scores" arrays with an 
IntIntScatterMap IntFloatScatterMap, thinking those would work better for small 
result sets. That ended up being worse (from a total allocations standpoint) 
for the queries I was trying, probably due to the map resizing necessary. It 
might be possible to set initial size values from statistics and help this case 
that way. It would also be possible to encode the docId/score into a long and 
just use one IntLongScatterMap, but I didn't try that.

> CollapseQParserPlugin allocations are index based, not query based
> --
>
> Key: SOLR-9125
> URL: https://issues.apache.org/jira/browse/SOLR-9125
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Reporter: Jeff Wartes
>Priority: Minor
>  Labels: collapsingQParserPlugin
>
> Among other things, CollapsingQParserPlugin’s OrdScoreCollector allocates 
> space per-query for: 
> 1 int (doc id) per ordinal
> 1 float (score) per ordinal
> 1 bit (FixedBitSet) per document in the index
>  
> So the higher the cardinality of the thing you’re grouping on, and the more 
> documents in the index, the more memory gets consumed per query. Since high 
> cardinality and large indexes are the use-cases CollapseQParserPlugin was 
> designed for, I thought I'd point this out.
> My real issue is that this does not vary based on the number of results in 
> the query, either before or after collapsing, so a query that results in one 
> doc consumes the same amount of memory as one that returns all of them. All 
> of the Collectors suffer from this to some degree, but I think OrdScore is 
> the worst offender.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org