Field collapsing, facets, and qtime: caching issue?
I'm experimenting with field collapsing in solrcloud 6.2.1 and have this set of request parameters against a collection: /default?indent=on=*:*=json={!collapse+field=groupid} My default handler is just defaults: explicit The first query runs about 600ms, then subsequent repeats of the same query are 0-5ms for qTime, which I interpret to mean that the query is cached after the first hit. All as expected. However, if I enable facets without actually requesting a facet: /default?indent=on=*:*=json={!collapse+field=groupid}=true then every submission of the query runs at ~600ms. I interpret this to mean that caching is somehow defeated when facet processing is set. Facets are empty as expected: facet_counts": { "facet_queries": { }, "facet_fields": { }, "facet_ranges": { }, "facet_intervals": { }, "facet_heatmaps": { } } If I remove the collapse directive /default?indent=on=*:*=json=true qTimes are back down to 0 after the initial query whether or not faceting is requested. Is this expected behaviour or am I missing some supporting configuration for proper field collapsing? Thanks! Ron
Solr 4.4, enablePositionIncrements=true and PhraseQueries
Hello, I'm working on an upgrade from solr 1.4.1 to 4.4. One of my field analyzers uses StopWordFilter, which as of 4.4 is forbidden to set enablePositionIncrements to false. As a consequence, some hand-constructed phrase queries (basically generated via calls to SolrPluginUtils.parseQueryStrings on field:value text snippets) seem to now be failing relative to 1.4.1 because (I think) of the created gaps in phrase query content. By way of example, I have indexed text of the form Old Ones and query text of the form The Old Ones. Debug output shows my phrase query being generated as field:? Old Ones and that seems to not match indexed source text of Old Ones, presumably since there is no initial token to fill the gap. With positionIncrements set to false (tested by setting LUCENE_43 temporarily in solrconfig) to bypass the forced 4.4 restriction, it does what I expect (and what 1.4.1 does) in just outright ignoring the stop words with a generated query of field:Old Ones that matches my source text. Is there a way to configure phrase queries to ignore gaps, or otherwise ignore positioning information for missing/removed tokens? Fiddling with slops is not a viable option -- I need exact sequential matching on my token sequences apart from stopword presence. A workaround that occurred was perhaps adding a position normalizer filter that resets the term positions to sequential, but I'm hoping there may be some other configuration option to restore backwards-compatible phrase matching given the neutering of enablePositionIncrements. Thanks! Ron
Re: SpellCheckComponent: No file-based suggestions + Location issue
I finally had a chance to get back to this and got the file-based spell checker up and going. I thought I'd close the loop on this thread in case others downstream somehow managed to reproduce my silliness. I see the n-grams (n=3,4) but the text looks interspersed with spaces. The issue was simply a file-encoding problem: I was (foolishly) editing my dictionary file using WordPad and saving as Unicode, not realizing that this mapped to UTF-16, hence the extra pad characters. Thanks for the tips and the nice work on the spell checker! Ron
Re: SpellCheckComponent: No file-based suggestions + Location issue
Shalin: The index directory location is being created inside the current working directory. We should change that. I've opened SOLR-604 and attached a patch which fixes this. I updated from nightly build to incorporate your fix and it works perfectly, now building the spell indexes in solr/data. Thanks! Grant: What happens when you open the built index in Luke (http://www.getopt.org/luke)? Hmm, it looks a bit spacey -- I see the n-grams (n=3,4) but the text looks interspersed with spaces. Perhaps this is an artifact of Luke or n-grams are supposed to be this way, but that would obviously seem problematic. Here are some snips: word h i s t o r y word p i z z a gram3 i z gram3 i Did you see any exceptions in your log? Just a warning which I've ignored based on the discussions in SOLR-572: WARNING: No fieldType: null found for dictionary: external. Using WhitespaceAnalzyer. Oddly, even if I specify a fieldType with a legitimate field type (e.g., spell) from my schema.xml, this same warning is thrown, so I assume the parameter is functionless. WARNING: No fieldType: spell found for dictionary: external. Using WhitespaceAnalzyer. fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Ron
Re: NullPointerException at lucene.analysis.StopFilter with 1.3
: I'm just looking into transitioning from solr 1.2 to 1.3 (trunk). I : have some legacy handler code (called AdvancedRequestHandler) that : used to work with 1.2 but now throws an exception using 1.3 (latest : nightly build). This is an interesting use case that wasn't really considered when we switched away from using hte SolrCore singlton ... When I have some more time, i'll spin up a thread on solr-dev to discuss what we should do about this -- n the mean time feel free to file a bug that StopFilter isn't backwards compatible. Created SOLR-594 for this issue. FWIW: constructing a new TokenizerChain inside your RequestHandlers handeRequest method seems unneccessary. if nothing else, you could do this in your init method and reuse the TokenizerChain on every request. but if it were me, I'd just use the schema.xml to declare a fieldtype that had the behavior i want, and then use schema.getFieldType(specialType).getQueryAnalyzer().tokenStream(...) I actually had a single reusable version, but flattened it back out in the code snippet for clarity. But thanks for the tactful suggestion. :-) I didn't know that you could fetch the tokenizer chain directly from the schema (how cool), which was what was originally desired -- the constructed tokenizer was just mirroring an existing field. I appreciate the tip, Hoss -- much cleaner! r
sp.dictionary.threshold parm of spell checker seems unresponsive
I'm playing around with the spell checker on 1.3 nightly build and don't see any effect on changes to the sp.dictionary.threshold in terms of dictionary size. A value of 0.0 seems to create a dictionary of the same size and content as a value of 0.9. (I'd expect a very small dictionary in the latter case.) I think sp.dictionary.threshold is a float parameter, but maybe I'm misunderstanding? And just to be sure, I assume I can alter this parameter prior to issue the rebuild command to build the dictionary -- I don't need to reindex termSourceField between changes? My solrconfig.xml has this definition for the handler: requestHandler name=spellchecker class=solr.SpellCheckerRequestHandler startup=lazy lst name=defaults int name=sp.query.suggestionCount30/int float name=sp.query.accuracy0.5/float /lst str name=sp.dictionary.indexDirspell/str str name=termSourceFielddictionary/str float name=sp.dictionary.threshold0.9/float /requestHandler And schema.xml in case that is somehow relevant: fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType field name=dictionary type=spell indexed=true stored=false multiValued=true omitNorms=true / Any advice? I'd definitely like to tighten up the dictionary but it appears to always include terms regardless of their frequency in the source content. Thanks, Ron
Re: Making stop-words optional with DisMax?
We use two fields, one with and one without stopwords. The exact field has a higher boost than the other. That works pretty well. Thanks for the tip, wunder! We are doing likewise for our pf parm of DisMax and that part works well -- exact matches are highly relevant and stopped-matches less so but still present in the results set. The main problem is getting past the qf parm such that we don't have invisible titles (stop-words removed by the qf pipeine leaving an empty query) or over-specified generated queries (where stop-words turn out to be required but can't match for various reasons). It helps to have an automated relevance test when tuning the boost (and other things). I extracted queries and clicks from the logs for a couple of months. Not perfect, but it is hard to argue with 32 million clicks. I'd say -- a dream data set. :-) Good idea on the relevance test -- eyeballing boost changes seems definitely prone to unexpected effects across all of the queries one didn't think to try. (A dark art, boost tuning...) Ron
Re: Making stop-words optional with DisMax?
sure, but what logic would you suggest be used to decide when to make them optional? :) Operationally, I was thinking a tokenizer could use the stop-word list (or an optional-word list) to mark tokens as optional rather than removing them from the token stream. DisMaxOptional would then generate appropriate queries with the non-optionals as the core and then permute the optionals around those as optional clauses. I say this with no deep understanding of how DisMax does its thing, of course, so feel free to call me naive. As to what words to put in the optionals list, the function words (articles and prepositions) seem to be the ones that folks either omit or confuse, so they'd be good candidates. start by hitting Solr using a qf with fields that contain stop words. if you get 0 hits, then query with a qf that contains all fields that don't have stop words in them, (but you can leave them in pf). I think I've so internalized list advice *not* to generate multiple queries that that didn't readily occur to me. :-) One problem I suppose is that query might return some results but not the desired one (perhaps there is a title On Men and Mice) and so I don't get to the second query (mice men once stopped) that would get me Of Mice and Men. But an improvement in cases where no results come back from an overspecified query, I'd agree. The other thought I've had is to just do some query analysis up front prior to submission -- if the query is all stops, send it to a separate handler that doesn't do stop-word removal in the qf specification, otherwise if any non-stop-word exists, send it to a handler with a qf that does remove stops and rely on the pf component to boost up exact matches. I hate the analysis step which would probably duplicate the tokenization done by solr, but might be worth it. There'd still be some problematic queries, but this may be as close as it'll get. Thanks for the suggestions, Hoss! Ron
Making stop-words optional with DisMax?
I've followed the stop-word discussion with some interest, but I've yet to find a solution that completely satisfies our needs. I was wondering if anyone could suggest some other options to try short of a custom handler or building our own queries (DisMax does such a fine job generally!). We are using DisMax, and indexing media titles (books, music). We want our queries to be sensitive to stop-words, but not so sensitive that we fail to match on missing or incorrect stop-words. For example, here are a set of queries and desired behavior: * it - matches It by steven king (high relevance) and other titles with it therein, e.g. Some Like It Hot (lower relevance) * the the - matches music by The The, other titles with the therein at lower relevance are fine * the sound of music - matches The Sound of Music high relevance * a sound of music - still matches The Sound of Music, lower relevance is fine * the doors - matches music by The Doors, even though it is indexed just as Doors (our data supplier drops the definite article) * the life - matches titles The Life with high relevance, matches titles of just Life with lower relevance Basically, we want direct matches (including stop-words) to be highly relevant and we use the phrase query mechanism for that, but we also want matches if the user mis-remembers the correct (stopped) prepositions or inserts a few irrelevant stop-words (like articles). We see this in the wild with non-trivial frequency -- the wrong choice of preposition (on mice and men) or an article used that our data supplier didn't include in the original version (doors). One thing we tried is to include both a stopped version and a non-stopped version of the title in the qf field, in the hopes that this would retrieve all titles without stop-words and still allow us to include pure stop-word queries (it). However, DisMax constructs queries such that mixing stopped and non-stopped fields doesn't work as one might hope, as described well here: http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a2461 Since qf controls the initial set of results retrieved for DisMax, and we don't want to use a pure stopped set of fields there (because we won't match on it as a query) nor a pure non-stopped set (won't get results for a sound of music), we'd seem to be out of luck unless we can figure out a way to augment the qf coverage. We've tried relaxing query term requirements to allow a missing word or two in the query via mm, but recall is amped up too much since non-stop-words tend to be dropped and you get a lot of results that match primarily just across stop-words. We've also considered creating a sort of equivalence class for all stop-words (defining synonyms to map stops to some special token) which would allow mis-remembered stop-words to be conflated, but then something like it would match anything that contained any stop-word -- again, too high on the recall. What I think we want is something like an optional stop-word DisMax that would mark stops as optional and construct queries such that stop-words aren't passed into fields that apply stop-word removal in query clauses (if that makes sense). Has anyone done anything similar or found a better way to handle stops that exhibits the desired behavior? Thanks in advance for any thoughts! And, being new to Solr, apologies if I'm confused in my reasoning somewhere. Ron
Re: Making stop-words optional with DisMax?
Hi Otis, I skimmed your email. You are indexing book and music titles. Those tend to be short. Do you really benefit from removing stop words in the first place? I'd try keeping all the stop words and seeing if that has any negative side-effects in your context. Thanks for your skim and response! We do keep all stop-words -- as you say, makes sense since we aren't dealing with long free text fields and because some titles are pure stops. The negative side-effects lie in stop-words being treated with the same importance as non-stop-words for matching purposes. This manifests in two ways: 1. Users occasionally get the stop-words wrong -- say, wrong choice of preposition, which torpedoes the query since some of the query terms aren't present in the target. For example on mice and men may return nothing (no match for on) even though it is equivalent to of mice and men in a stopped sense. 2. Our original indexed data doesn't always have leading articles and such. For example, we index on Doors since that is our sourced data but frequently get queried for The Doors. Articles and prepositions (the stuff of good stop-lists) seem to me to be in a fuzzier class -- use 'em if you have 'em during matching, but don't kill your queries because of them. Hence some desire to make them in some way optional during matching. Ron
Simple sorting questions
Pardon the basicness of these questions, but I'm just getting started with SOLR and have a couple of confusions regarding sorting that I couldn't resolve based on the docs or an archive search. 1. There appears to be (at least) two ways to specify sorting, one involving an append to the q parm and the other using the sort parm. Are these exactly equivalent? http://localhost/solr/select/?q=martha;author+asc http://localhost/solr/select/?q=marthasort=author+asc 2. The docs say that sorting can only be applied to non-multivalued fields. Does this mean that sorting won't work *at all* for multi-valued fields or only that the behaviour is indeterminate? Based on a brief test, sorting a multi-valued field appeared to work by picking an arbitrary value when multiple values are present and using that for the sort. I wanted to confirm that the expected behaviour is indeed to sort on something (with no guarantees as to what), as opposed to, say, dropping the record, putting the record with multi-values at the end with the missing valued records, or something else entirely. Thanks! Ron