Re: Query with exact number of tokens

Erick Erickson Fri, 21 Sep 2018 07:46:54 -0700

A variant on Alexandre's approach is:
at index time, count the tokens that will be produced yourself (this
may be a little tricky, you shouldn't have WordDelimiterFilterFactory
in your analysis for instance).
Put the number of tokens in a separate field
At query time, you'd search q=+company_name:(+century +bancorp +inc)
+tokens_in_company_name_field:3


You don't need phrase queries with this approach, order doesn't matter.

It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
BANCORP, INCORPORATED." match?

Again, though, this means your indexing code has to do the same thing
as your analysis chain. Which isn't very hard if the analysis chain is
simple. I might use a char _filter_ factory to remove all
non-alphanumeric characters, then a whitespace tokenizer and
(probably) a lowercasefilter. That's pretty easy to replicate in order
to count tokens.

Best,
Erick
On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
>
> I think you can match everything in the query to the field using either
> 1) disMax/eDisMax with mm=100%
> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter
> 2) Complex Phrase Query Parser with inOrder=false:
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>
> The number of tokens though is hard. You only know what your tokens
> are at the end of the indexing pipeline. And during search, the tokens
> are looked up from their indexes and only then the documents are
> looked up.
>
> You may be able to do this with custom Postfilter that would run after
> everything else to just reject records with extra tokens. That would
> not be too expensive.
>
> Or (possibly simpler way) you could try to precalculate things, by
> writing a custom TokenFilter that takes a stream and returns token
> count to be used as a copyField target. Then you send your query to
> the same field with any full-query preserving syntax, either as a
> phrase or as a field query parser:
> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>
> I would love to know if any/all of this works for you.
>
> Regards,
>    Alex.
>
> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote:
> > Hi,
> >
> > I have to search for company names where my first requirement is to find
> > only exact matches on the company name.
> >
> > For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW
> > CENTURY BANCORP, INC."
> > because the result company has the extra keyword "NEW".
> >
> > I can't use exact match because the sequence of tokens may differ. Basically
> > I need to find results where the  tokens are the same in any order and the
> > number of tokens match.
> >
> > I have no idea if it's possible as include in the query the number of tokens
> > and solr field has that info within to match it.
> >
> > Thanks for your help
> > Sergio
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Query with exact number of tokens

Reply via email to