Re: CommonTerms & slow queries

2019-03-29 Thread Michael Gibney
Can you post the query that's actually built for some of these inputs ("parsedquery" or "parsedquery_toString" output included for requests with "debug=query" parameter)? What is performance like if you turn off pf (i.e., no implicit phrase searching)? Michael On Fri, Mar 29, 2019 at 11:53 AM

Re: Query of death? Collapsing Query Parser - Solr 7.5

2019-03-26 Thread Michael Gibney
Would you be willing to share your query-time analysis chain config, and perhaps the "debug=true" (or "debug=query") output for successful queries of a similar nature to the problematic ones? Also, re: "only times out on extreme queries" -- what do you consider to be an "extreme query", in this

Re: CommonTerms & slow queries

2019-03-29 Thread Michael Gibney
cuts time almost half but its still 5+sec > > Thank you for your help, more than happy to include more output.. > -Craig > > > On Fri, Mar 29, 2019 at 12:24 PM Michael Gibney > > wrote: > > > Can you post the query that's actually built for some of these inputs &

Re: Performance problems with extremely common terms in collection (Solr 7.4)

2019-04-08 Thread Michael Gibney
In addition to Toke's suggestions (and those in the linked article), some more ideas: If single-term, bare queries are slow, it might be productive to check config/performance of your queryResultCache (I realize this doesn't directly address the concern of slow queries, but might nonetheless be

Re: Is anyone using proxy caching in front of solr?

2019-02-25 Thread Michael Gibney
Tangentially related, possibly of interest regarding solr-internal cache hit ratio (esp. with a lot of replicas): https://issues.apache.org/jira/browse/SOLR-13257 On Mon, Feb 25, 2019 at 11:33 AM Walter Underwood wrote: > Don’t worry about one and two character queries, because they will almost

Re: Query of Death Lucene/Solr 7.6

2019-02-22 Thread Michael Gibney
Ah... I think there are two issues likely at play here. One is LUCENE-8531 , which reverts a bug related to SpanNearQuery semantics, causing possible query paths to be enumarated up front. Setting ps=0 (although perhaps not appropriate for some

Re: ExactStatsCache not working for distributed IDF

2019-03-14 Thread Michael Gibney
Are you basing your conclusion (that it's not working as expected) on the scores as reported in the debug output? If you haven't already, try adding "score" to the "fl" param -- if different (for a given doc) than the score as reported in debug, then it's probably working as intended ... just a

Re: Solr index slow response

2019-03-19 Thread Michael Gibney
I'll second Emir's suggestion to try disabling swap. "I doubt swap would affect it since there is such huge free memory." -- sounds reasonable, but has not been my experience, and the stats you sent indicate that swap is in fact being used. Also, note that in many cases setting vm.swappiness=0 is

Re: Query of Death Lucene/Solr 7.6

2019-02-08 Thread Michael Gibney
Hi Markus, As of 7.6, LUCENE-8531 reverted a graph/Spans-based phrase query implementation (introduced in 6.5 -- LUCENE-7699 ) to an implementation that builds a separate phrase query for each

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

2019-05-17 Thread Michael Gibney
The SpanNearQuery in association with "a.b." input and WDGF is expected behavior, since WDGF causes the query to search ("ab")|("a" "b"), as 1 or 2 tokens, respectively. The "a. b." input (whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so sticks with the more straightforward

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

2019-05-17 Thread Michael Gibney
ld both be evaluated before ("a" "b"), leaving the impression of a gap between tokens, causing the match to be missed. On Fri, May 17, 2019 at 12:29 PM Michael Gibney wrote: > > The SpanNearQuery in association with "a.b." input and WDGF is > expected behav

Re: Query of Death Lucene/Solr 7.6

2019-05-30 Thread Michael Gibney
Very likely: https://issues.apache.org/jira/browse/SOLR-13336 Individual queries should still fail, but should fail fast, without the broader impact seen prior to 8.1. Does that describe the behavior you're seeing now with 8.1.1? Michael On Thu, May 30, 2019 at 11:55 AM Markus Jelsma wrote: > >

Re: HttpShardHandlerFactory

2019-08-19 Thread Michael Gibney
Mark, Another thing to check is that I believe the configuration you posted may not actually be taking effect. Unless I'm mistaken, I think the correct element name to configure the shardHandler is "shardHandler*Factory*", not "shardHandler" ... as in, '...' The element name is documented

Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-12-05 Thread Michael Gibney
I wonder if this might be similar/related to the underlying problem that is intended to be addressed by https://issues.apache.org/jira/browse/LUCENE-8985? btw, I think you only want to use FlattenGraphFilter *once* in the indexing analysis chain, towards the end (after all components that emit

Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Michael Gibney
Going to back to the initial question, the wording is a little ambiguous and it occurs to me that it's possible there's a misunderstanding of what autoGeneratePhraseQueries does. It really only auto-generates phrase *subqueries*. To use the example from the initial request, a query like (black

Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-19 Thread Michael Gibney
sed to > still be a fail safe, did I miss something? > > Thanks again for your help, > > Nick > > On Wed, Dec 18, 2019, 8:10 AM Michael Gibney > wrote: > > > This is related to this issue: > > https://issues.apache.org/jira/browse/SOLR-13336 > >

Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-18 Thread Michael Gibney
This is related to this issue: https://issues.apache.org/jira/browse/SOLR-13336 Also tangentially relevant: https://issues.apache.org/jira/browse/LUCENE-8531 https://issues.apache.org/jira/browse/SOLR-12243 I think your options include: 1. setting slop=0, which restores SpanNearQuery as the

Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Michael Gibney
Possibly worth mentioning, although it might not be appropriate for your use case: if the fields you're interested in are configured with docValues, you could use streaming expressions (or directly handle thread-per-shard connections to the /export handler) and get everything in a single shot

Re: Phrase search and WordDelimiterGraphFilter not working as expected with mixed delimited and non-delimited tokens

2020-02-19 Thread Michael Gibney
There are many layers to this, but for the config you posted (applying index-time WDGF configured to both split and catentate tokens), the fundamental issue is that Lucene doesn't index positionLength, so the graph structure (and token adjacency information) of the token stream is lost when it's

Re: Solr 7.7 heap space is getting full

2020-01-22 Thread Michael Gibney
Rajdeep, you say that "suddenly" heap space is getting full ... does this mean that some variant of this configuration was working for you at some point, or just that the failure happens quickly? If heap space and faceting are indeed the bottleneck, you might make sure that you have docValues

Re: Nested Document with replicas slow

2020-04-13 Thread Michael Gibney
Depending on how you're measuring performance (and whether your use case benefits from caching), it might be worth looking into stable replica routing (configured with the "replica.base" sub-parameter of the shards.preference

Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
distributed requests. On Mon, May 11, 2020 at 1:49 PM Michael Gibney wrote: > > Wei, probably no need to answer my earlier questions; I think I see > the problem here, and believe it is indeed a bug, introduced in 8.3. > Will file an issue and submit a patch shortly. > Michael > >

Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
Hi Wei, In considering this problem, I'm stumbling a bit on terminology (particularly, where you mention "nodes", I think you're referring to "replicas"?). Could you confirm that you have 10 TLOG replicas per shard, for each of 6 shards? How many *nodes* (i.e., running solr server instances) do

Re: Unbalanced shard requests

2020-05-15 Thread Michael Gibney
al,replica.type:TLOG, > I also tried just shards.preference=replica.location:local and it still has > the issue. Can you explain a bit more? > > On Mon, May 11, 2020 at 12:26 PM Michael Gibney > wrote: > > > FYI: https://issues.apache.org/jira/browse/SOLR-14471 > &g

Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
Wei, probably no need to answer my earlier questions; I think I see the problem here, and believe it is indeed a bug, introduced in 8.3. Will file an issue and submit a patch shortly. Michael On Mon, May 11, 2020 at 12:49 PM Michael Gibney wrote: > > Hi Wei, > > In considering this

Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread Michael Gibney
As you've observed, it is indeed possible to facet on fields with docValues=true, indexed=false; but in almost all cases you should probably set indexed=true. 1. for distributed facet count refinement, the "indexed" approach is used to look up counts by value; 2. assuming you're wanting to do

Re: Solr 8.3.1 longer query latency over 6.4.2

2020-08-21 Thread Michael Gibney
do see much higher thread count (60 for Solr 6 vs 150 > for Solr 8 on average) even on a relatively quiet system. That seems an > interesting statistic, but not really sure what it signifies. We mostly > take the OOTB defaults for most everything, and config changes were > minim

Re: Solr 8.3.1 longer query latency over 6.4.2

2020-08-19 Thread Michael Gibney
Hi Elaine, I'm curious what happens if you remove "pf" (phrase field) setting from your edismax config? This question brought to mind https://issues.apache.org/jira/browse/SOLR-12243?focusedCommentId=16836448#comment-16836448 and https://issues.apache.org/jira/browse/LUCENE-8531. This *could*

Re: Simulate facet.exists for json query facets

2020-10-28 Thread Michael Gibney
Separately, and in parallel to Erick's question: indeed I'm not aware of any way to do this currently, but I *can* imagine cases where this would be useful. I have a sense this could be cleanly implemented as a stat facet function

Re: json.facet floods the filterCache

2020-10-26 Thread Michael Gibney
uide/8_6/json-facet-api.html#nested-facets) > or sub-facets, and am using the 'terms' facet. > > Digging around more looks like I can set 'cacheDf=-1' to disable the use of > the cache. > > On Fri, 23 Oct 2020 at 00:14, Michael Gibney > wrote: > > > Damien, > &

Re: Facet Performance

2020-06-17 Thread Michael Gibney
facet.method=enum works by executing a query (against indexed values) for each indexed value in a given field (which, for indexed=false, is "no values"). So that explains why facet.method=enum no longer works. I was going to suggest that you might not want to set indexed=false on the docValues

Re: Facet Performance

2020-06-17 Thread Michael Gibney
enough > > memory to the filterCache. > > > > I haven't yet tried changing the uninvertible setting, I was looking at the > > documentation for this field earlier today. > > Should we be setting uninvertible="false" if docValues="true" regardless of

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Michael Gibney
I agree with Shawn that the top contenders so far (from my perspective) are "primary/secondary" and "publisher/subscriber", and agree with Walter that whatever term pair is used should ideally be usable *as a pair* (to identify a cluster type) in addition to individually (to identify the

Re: [EXTERNAL] - SolR OOM error due to query injection

2020-06-11 Thread Michael Gibney
Guilherme, The answer is likely to be dependent on the query parser, query parser configuration, and analysis chains. If you post those it could aid in helping troubleshoot. One thing that jumps to mind is the asterisks ("*") -- if they're interpreted as wildcards, that could be problematic? More

Re: Solr using all available CPU and becoming unresponsive

2021-01-11 Thread Michael Gibney
Hi Jeremy, Can you share your analysis chain configs? (SOLR-13336 can manifest in a similar way, and would affect 7.3.1 with a susceptible config, given the right (wrong?) input ...) Michael On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith wrote: > Hello all, > We have been struggling with an

Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Michael Gibney
t copying existing structures from the defaults. In this case, > stopwords.txt is completely empty and synonyms.txt is just the default > synonyms.txt, which seems not useful at all for us. Could I just take out > the StopFilterFactory and SynonymGraphFilterFactory from the query section > (an

Re: nested facets of query and terms type in JSON format

2020-12-03 Thread Michael Gibney
Arturas, I think your syntax is wrong for the range subfacet? -- the configuration of the range facet should be directly under the `tt` key, rather than nested under `t_buckets` in the request. (The response introduces a "buckets" attribute that is not part of the request syntax). Michael On Thu,

Re: nested facets of query and terms type in JSON format

2020-12-03 Thread Michael Gibney
0:00.000Z", > "end": "2020-11-16T21:00:00.000Z", > "gap": "+1HOUR" > "limit": 1 > } > } > } > } > } > >

Re: Multiple Facets on Same Field

2020-11-17 Thread Michael Gibney
Answering a slightly different question perhaps, but you can definitely do this with the "JSON Facet" API, where there's much cleaner separation between different facets (and output is assigned to arbitrary keys). Michael On Tue, Nov 17, 2020 at 9:36 AM Jason Gerlowski wrote: > > Hi all, > > Is

Re: Multiple Facets on Same Field

2020-11-17 Thread Michael Gibney
n explicit "{!terms}" query as a > filter. I see you suggested that as a workaround here [1]. > > Jason > > [1] > http://mail-archives.apache.org/mod_mbox/lucene-dev/202010.mbox/%3CCAF%3DheHGKwGtvq%3DgAndmVrgvo1cxKmzP0neGi17_eoVhubpaBZA%40mail.gmail.com%3E > > On Tue, Nov 17,

Re: Avoiding duplicate entry for a multivalued field

2020-10-29 Thread Michael Gibney
If I understand correctly what you're trying to do, docValues for a number of field types are (at least in their multivalued incarnation) backed by SortedSetDocValues, which inherently deduplicate values per-document. In your case it sounds like you could maybe rely on that behavior as a feature,

Re: Simulate facet.exists for json query facets

2020-10-30 Thread Michael Gibney
an > demonstrate need. > > If you issue a debug=timing you’ll see the time each component > takes, and there’s a separate entry for faceting so that’ll give you > a clue whether it’s worth the effort. > > Best, > Erick > > > On Oct 30, 2020, at 8:10 AM, Michael Gibney

Re: Simulate facet.exists for json query facets

2020-10-30 Thread Michael Gibney
Michael, sorry for the confusion; I was positing a *hypothetical* "exists()" function that doesn't currently exist, that *is* an aggregate function, and the *does* stop early. I didn't account for the fact that there's already an "exists()" function *query* that behaves very differently. So yes,

Re: json.facet floods the filterCache

2020-10-22 Thread Michael Gibney
Damien, Are you able to share the actual json.facet request that you're using (at least just the json.facet part)? I'm having a hard time being confident that I'm correctly interpreting when you say "a json.facet query on nested facets terms". Michael On Thu, Oct 22, 2020 at 3:52 AM Christine

Re: Handling acronyms

2021-01-15 Thread Michael Gibney
The equivalent terms on the right-hand side of the `=>` operator in the example you sent should be separated by a comma. You mention you already tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`) and that that yielded unexpected results as well. I would recommend

Re: Handling acronyms

2021-01-15 Thread Michael Gibney
minded me of the > expand option which I meant to have a look at. > > Thanks > Shaun > > On Fri, 15 Jan 2021 at 14:33, Michael Gibney > wrote: > > > The equivalent terms on the right-hand side of the `=>` operator in the > > example you sent should be separ

Re: Solrcloud - Reads on specific nodes

2021-01-15 Thread Michael Gibney
I know you're asking about nodes, not replicas; but depending on what you're trying to achieve you might be as well off routing requests based on replica. Have you considered the various options available via the `shards.preference` param [1]? For instance, you could set up your "write" replicas

Re: Handling acronyms

2021-01-15 Thread Michael Gibney
EDIT: "the equivalent terms are separated by commas (as they should be)" => "the equivalent terms are _not_ separated by commas (as they should be)" On Fri, Jan 15, 2021 at 10:09 AM Michael Gibney wrote: > Shaun, > > I'm not 100% sure, but don't give up on this

Re: DocValued SortableText Field is slower than Non DocValued String Field for Facet

2021-01-28 Thread Michael Gibney
I'm not sure about _performance_, but I'm pretty sure you don't want to be faceting on docValued SortableTextField (and faceting on non-docValued SortableTextField, though I think technically possible, works against uninverted _indexed_values, so ends up doing something entirely different):

Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
> Performance and resource is still affected by 30M unique values of T right? Yes. The main performance issue would be the per-request allocation of a 30M-element `long[]` for "dv" or "uif" methods (which are by far the most common methods in practice). With low enough request volume and large

Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
On Fri, Feb 5, 2021 at 12:49 PM Michael Gibney wrote: > > Performance and resource is still affected by 30M unique values of T > right? > Yes. The main performance issue would be the per-request allocation of a > 30M-element `long[]` for "dv" or "uif" methods

Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
t; > One thing I can add is I tried dvhash with a string multi-valued field, it > worked and didn’t throw any error but I don’t know if it got silently > ignored or just worked. > > Sent from Mail for Windows 10 > > From: Michael Gibney > Sent: 05 February 2021 20:52 > To

Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread Michael Gibney
`resultId` sounds like it might be a relatively high-cardinality field (lots of unique values)? What's your number of shards, and replicas per shard? SOLR-15008 (note: not a bug) describes a situation that may be fundamentally similar to yours (though to be sure it's impossible to say for sure

Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread Michael Gibney
Ah! that's significant. The latency is likely due to building the OrdinalMap (which maps segment ords to global ords) ... "dvhash" (assuming the relevant fields are not multivalued) will very likely work; "dvhash" doesn't map to global ords, so doesn't need to build the OrdinalMap (which gets

Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread Michael Gibney
Apologies, I missed deducing from the request url that you're already talking strictly about single-shard requests (so everything I was suggesting about shards.preference etc. is not applicable). "dvhash" is still worth a try though, esp. with `numFound` being 943 (out of 185 million!). Does this