Re: solr tokenizer not splitting unbreakable expressions

Tanguy Moal Tue, 22 May 2012 08:10:03 -0700

Hello Elisabeth,

Wouldn't it be more simple to have a custom component inside of the
front-end to your search server that would transform a query like <<hotel
de ville paris>> into <<"hotel de ville" paris>> (I.e. turning each
occurence of the sequence "hotel de ville" into a phrase query ) ?


Concerning protections inside of the tokenizer, I think that is not
possible actually.
The main reason for this could be that the QueryParser will break the query
on each space before passing each query-part through the analysis of every
searched field. Hence all the smart things you would put at indexing time
to wrap a sequence of tokens into a single one is not reproducible at query
time.

Please someone correct me if I'm wrong!

Alternatively, I think you might do so with a custom query parser (in order
to have phrases sent to the analyzers instead of words). But since
tokenizers don't have support for protected words list, you would need an
additional custom token filter that would consume the tokens stream and
annotate those matching an entry in the protection list.
Unfortunately, if your protected list is long, you will have performance
issues. Unless you rely on a dedicated data structure, like Trie-based
structures (Patricia-trie, ...) You can find solid implementations on the
Internet (see https://github.com/rkapsi/patricia-trie).

Then you could make your filter consume a "sliding window" of tokens while
the window matches in your trie.
Once you have a complete match in your trie, the filter can set an
attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the
first matching token, and make the attribute be the complete match (e.g.
"Hotel de ville").
If you don't have a complete match, drop the unmatched tokens leaving them
unmodified.

I Hope this helps...

--
Tanguy


2012/5/22 elisabeth benoit <elisaelisael...@gmail.com>

> Hello,
>
> Does someone know if there is a way to configure a tokenizer to split on
> white spaces, all words excluding a bunch of expressions listed in a file?
>
> For instance, if a want "hotel de ville" not to be split in words, a
> request like "hotel de ville paris" would be split into two tokens:
>
> "hotel de ville" and "paris" instead of 4 tokens
>
> "hotel"
> "de"
> "ville"
> "paris"
>
> I imagine something like
>
> <tokenizer class="solr.StandardTokenizerFactory"
> protected="protoexpressions.txt"/>
>
> Thanks a lot,
> Elisabeth
>

Re: solr tokenizer not splitting unbreakable expressions

Reply via email to