Field collapsing, facets, and qtime: caching issue?

2017-02-10 Thread Ronald K. Braun
I'm experimenting with field collapsing in solrcloud 6.2.1 and have this
set of request parameters against a collection:

/default?indent=on=*:*=json={!collapse+field=groupid}

My default handler is just defaults:



explicit



The first query runs about 600ms, then subsequent repeats of the same query
are 0-5ms for qTime, which I interpret to mean that the query is cached
after the first hit.  All as expected.

However, if I enable facets without actually requesting a facet:

/default?indent=on=*:*=json={!collapse+field=groupid}=true

then every submission of the query runs at ~600ms.  I interpret this to
mean that caching is somehow defeated when facet processing is set.  Facets
are empty as expected:

facet_counts": {
  "facet_queries": { },
  "facet_fields": { },
  "facet_ranges": { },
  "facet_intervals": { },
  "facet_heatmaps": { }
}

If I remove the collapse directive

/default?indent=on=*:*=json=true

qTimes are back down to 0 after the initial query whether or not faceting
is requested.

Is this expected behaviour or am I missing some supporting configuration
for proper field collapsing?

Thanks!

Ron


Solr 4.4, enablePositionIncrements=true and PhraseQueries

2013-08-21 Thread Ronald K. Braun
Hello,

I'm working on an upgrade from solr 1.4.1 to 4.4.  One of my field
analyzers uses StopWordFilter, which as of 4.4 is forbidden to set
enablePositionIncrements to false.  As a consequence, some hand-constructed
phrase queries (basically generated via calls to
SolrPluginUtils.parseQueryStrings on field:value text snippets) seem to now
be failing relative to 1.4.1 because (I think) of the created gaps in
phrase query content.

By way of example, I have indexed text of the form Old Ones and query
text of the form The Old Ones.  Debug output shows my phrase query being
generated as field:? Old Ones and that seems to not match indexed source
text of Old Ones, presumably since there is no initial token to fill the
gap.

With positionIncrements set to false (tested by setting LUCENE_43
temporarily in solrconfig) to bypass the forced 4.4 restriction, it does
what I expect (and what 1.4.1 does) in just outright ignoring the stop
words with a generated query of field:Old Ones that matches my source
text.

Is there a way to configure phrase queries to ignore gaps, or otherwise
ignore positioning information for missing/removed tokens?  Fiddling with
slops is not a viable option -- I need exact sequential matching on my
token sequences apart from stopword presence.  A workaround that occurred
was perhaps adding a position normalizer filter that resets the term
positions to sequential, but I'm hoping there may be some other
configuration option to restore backwards-compatible phrase matching given
the neutering of enablePositionIncrements.

Thanks!

Ron


Re: SpellCheckComponent: No file-based suggestions + Location issue

2008-07-04 Thread Ronald K. Braun
I finally had a chance to get back to this and got the file-based
spell checker up and going.  I thought I'd close the loop on this
thread in case others downstream somehow managed to reproduce my
silliness.

 I see the n-grams (n=3,4) but the text looks interspersed with spaces.

The issue was simply a file-encoding problem: I was (foolishly)
editing my dictionary file using WordPad and saving as Unicode, not
realizing that this mapped to UTF-16, hence the extra pad characters.

Thanks for the tips and the nice work on the spell checker!

Ron


Re: SpellCheckComponent: No file-based suggestions + Location issue

2008-06-24 Thread Ronald K. Braun
Shalin:

 The index directory location is being created inside the current working
 directory. We should change that. I've opened SOLR-604 and attached a patch
 which fixes this.

I updated from nightly build to incorporate your fix and it works
perfectly, now building the spell indexes in solr/data.  Thanks!

Grant:

 What happens when you open the built index in Luke 
 (http://www.getopt.org/luke)?

Hmm, it looks a bit spacey -- I see the n-grams (n=3,4) but the text
looks interspersed with spaces.  Perhaps this is an artifact of Luke
or n-grams are supposed to be this way, but that would obviously seem
problematic.  Here are some snips:

word  h i s t o r y 
word  p i z z a 
gram3 i z
gram3  i 

 Did you see any exceptions in your log?

Just a warning which I've ignored based on the discussions in SOLR-572:

WARNING: No fieldType: null found for dictionary: external.  Using
WhitespaceAnalzyer.

Oddly, even if I specify a fieldType with a legitimate field type
(e.g., spell) from my schema.xml, this same warning is thrown, so I
assume the parameter is functionless.

WARNING: No fieldType: spell found for dictionary: external.  Using
WhitespaceAnalzyer.

fieldType name=spell class=solr.TextField positionIncrementGap=100
analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Ron


Re: NullPointerException at lucene.analysis.StopFilter with 1.3

2008-06-09 Thread Ronald K. Braun
 : I'm just looking into transitioning from solr 1.2 to 1.3 (trunk).  I
 : have some legacy handler code (called AdvancedRequestHandler) that
 : used to work with 1.2 but now throws an exception using 1.3 (latest
 : nightly build).

 This is an interesting use case that wasn't really considered when we
 switched away from using hte SolrCore singlton ...
 When I have some more time, i'll spin up a thread on solr-dev to discuss
 what we should do about this -- n the mean time feel free to file a bug
 that StopFilter isn't backwards compatible.

Created SOLR-594 for this issue.

 FWIW: constructing a new TokenizerChain inside your RequestHandlers
 handeRequest method seems  unneccessary.   if nothing else, you could
 do this in your init method and reuse the TokenizerChain on every request.
 but if it were me, I'd just use the schema.xml to declare a fieldtype that
 had the behavior i want, and then use
 schema.getFieldType(specialType).getQueryAnalyzer().tokenStream(...)

I actually had a single reusable version, but flattened it back out in
the code snippet for clarity.  But thanks for the tactful suggestion.
:-)  I didn't know that you could fetch the tokenizer chain directly
from the schema (how cool), which was what was originally desired --
the constructed tokenizer was just mirroring an existing field.  I
appreciate the tip, Hoss -- much cleaner!

r


sp.dictionary.threshold parm of spell checker seems unresponsive

2008-06-03 Thread Ronald K. Braun
I'm playing around with the spell checker on 1.3 nightly build and
don't see any effect on changes to the sp.dictionary.threshold in
terms of dictionary size.  A value of 0.0 seems to create a dictionary
of the same size and content as a value of 0.9.  (I'd expect a very
small dictionary in the latter case.)  I think sp.dictionary.threshold
is a float parameter, but maybe I'm misunderstanding?

And just to be sure, I assume I can alter this parameter prior to
issue the rebuild command to build the dictionary -- I don't need to
reindex termSourceField between changes?

My solrconfig.xml has this definition for the handler:

requestHandler name=spellchecker
class=solr.SpellCheckerRequestHandler startup=lazy
lst name=defaults
int name=sp.query.suggestionCount30/int
float name=sp.query.accuracy0.5/float
/lst
str name=sp.dictionary.indexDirspell/str
str name=termSourceFielddictionary/str
float name=sp.dictionary.threshold0.9/float
/requestHandler

And schema.xml in case that is somehow relevant:

fieldType name=spell class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

field name=dictionary type=spell indexed=true stored=false
multiValued=true omitNorms=true /

Any advice?  I'd definitely like to tighten up the dictionary but it
appears to always include terms regardless of their frequency in the
source content.

Thanks,

Ron


Re: Making stop-words optional with DisMax?

2008-03-27 Thread Ronald K. Braun
 We use two fields, one with and one without stopwords. The exact
 field has a higher boost than the other. That works pretty well.

Thanks for the tip, wunder!  We are doing likewise for our pf parm of
DisMax and that part works well -- exact matches are highly relevant
and stopped-matches less so but still present in the results set.  The
main problem is getting past the qf parm such that we don't have
invisible titles (stop-words removed by the qf pipeine leaving an
empty query) or over-specified generated queries (where stop-words
turn out to be required but can't match for various reasons).

 It helps to have an automated relevance test when tuning the boost
 (and other things). I extracted queries and clicks from the logs
 for a couple of months. Not perfect, but it is hard to argue with
 32 million clicks.

I'd say -- a dream data set.  :-)  Good idea on the relevance test --
eyeballing boost changes seems definitely prone to unexpected effects
across all of the queries one didn't think to try.  (A dark art, boost
tuning...)

Ron


Re: Making stop-words optional with DisMax?

2008-03-27 Thread Ronald K. Braun
 sure, but what logic would you suggest be used to decide when to make them
 optional?  :)

Operationally, I was thinking a tokenizer could use the stop-word list
(or an optional-word list) to mark tokens as optional rather than
removing them from the token stream.  DisMaxOptional would then
generate appropriate queries with the non-optionals as the core and
then permute the optionals around those as optional clauses.  I say
this with no deep understanding of how DisMax does its thing, of
course, so feel free to call me naive.

As to what words to put in the optionals list, the function words
(articles and prepositions) seem to be the ones that folks either omit
or confuse, so they'd be good candidates.

 start by hitting Solr using a qf with fields that contain stop words.  if
 you get 0 hits, then query with a qf that contains all fields that don't
 have stop words in them, (but you can leave them in pf).

I think I've so internalized list advice *not* to generate multiple
queries that that didn't readily occur to me.  :-)   One problem I
suppose is that query might return some results but not the desired
one (perhaps there is a title On Men and Mice) and so I don't get to
the second query (mice men once stopped) that would get me Of Mice
and Men.  But an improvement in cases where no results come back from
an overspecified query, I'd agree.

The other thought I've had is to just do some query analysis up front
prior to submission -- if the query is all stops, send it to a
separate handler that doesn't do stop-word removal in the qf
specification, otherwise if any non-stop-word exists, send it to a
handler with a qf that does remove stops and rely on the pf component
to boost up exact matches.  I hate the analysis step which would
probably duplicate the tokenization done by solr, but might be worth
it.  There'd still be some problematic queries, but this may be as
close as it'll get.

Thanks for the suggestions, Hoss!

Ron


Making stop-words optional with DisMax?

2008-03-26 Thread Ronald K. Braun
I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it - matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the - matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music - matches The Sound of Music high relevance
* a sound of music - still matches The Sound of Music, lower relevance is fine
* the doors - matches music by The Doors, even though it is indexed
just as Doors (our data supplier drops the definite article)
* the life - matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition (on mice and men) or an article used that our data
supplier didn't include in the original version (doors).

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries (it).  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a2461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on it as a query) nor a pure non-stopped set (won't get
results for a sound of music), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like it would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an optional stop-word DisMax
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron


Re: Making stop-words optional with DisMax?

2008-03-26 Thread Ronald K. Braun
Hi Otis,

 I skimmed your email.  You are indexing book and music titles.  Those tend to 
 be short.
 Do you really benefit from removing stop words in the first place?  I'd try 
 keeping all the stop
 words and seeing if that has any negative side-effects in your context.

Thanks for your skim and response!  We do keep all stop-words -- as
you say, makes sense since we aren't dealing with long free text
fields and because some titles are pure stops.

The negative side-effects lie in stop-words being treated with the
same importance as non-stop-words for matching purposes.  This
manifests in two ways:  1. Users occasionally get the stop-words wrong
-- say, wrong choice of preposition, which torpedoes the query since
some of the query terms aren't present in the target.  For example on
mice and men may return nothing (no match for on) even though it is
equivalent to of mice and men in a stopped sense.  2. Our original
indexed data doesn't always have leading articles and such.  For
example, we index on Doors since that is our sourced data but
frequently get queried for The Doors.  Articles and prepositions
(the stuff of good stop-lists) seem to me to be in a fuzzier class --
use 'em if you have 'em during matching, but don't kill your queries
because of them.  Hence some desire to make them in some way
optional during matching.

Ron


Simple sorting questions

2007-11-07 Thread Ronald K. Braun
Pardon the basicness of these questions, but I'm just getting started
with SOLR and have a couple of confusions regarding sorting that I
couldn't resolve based on the docs or an archive search.

1. There appears to be (at least) two ways to specify sorting, one
involving an append to the q parm and the other using the sort parm.
Are these exactly equivalent?

   http://localhost/solr/select/?q=martha;author+asc
   http://localhost/solr/select/?q=marthasort=author+asc

2. The docs say that sorting can only be applied to non-multivalued
fields.  Does this mean that sorting won't work *at all* for
multi-valued fields or only that the behaviour is indeterminate?
Based on a brief test, sorting a multi-valued field appeared to work
by picking an arbitrary value when multiple values are present and
using that for the sort.  I wanted to confirm that the expected
behaviour is indeed to sort on something (with no guarantees as to
what), as opposed to, say, dropping the record, putting the record
with multi-values at the end with the missing valued records, or
something else entirely.

Thanks!

Ron