RE: FW: Difference Between Tokenizer and filter

Vanlerberghe, Luc Thu, 03 Mar 2016 06:04:51 -0800

The "index" type analyzer is used when documents are indexed and determines 
what tokens end up in the index.
The "query" type analyzer is used to analyze the user query and determines what 
tokens will be searched for.

As an example: If you want to be able to match on synonyms, you could have a 
"query" type analyzer that replaces each token in the users' query with the 
list of corresponding synonyms. The "index" type analyzer should just index the 
tokens as they are.

(If you have a fixed list of synonyms, both could map each token to a 
pre-defined 'canonical' synonym and save both index and query time)

Luc

-----Original Message-----
From: G, Rajesh [mailto:r...@cebglobal.com] 
Sent: donderdag 3 maart 2016 14:51
To: solr-user@lucene.apache.org
Subject: RE: FW: Difference Between Tokenizer and filter

Hi Shawn,

One last question on analyzer. If the format of the index on disk is not 
controlled by the tokenizer, or anything else in the analysis chain, then what 
does type="index" and type="query" in analyzer mean. Can you please help me in 
understanding?

        <analyzer type="index">

         </analyzer>
         <analyzer type="query">

         </analyzer>

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh
Sent: Thursday, March 3, 2016 6:12 PM
To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org>
Subject: RE: FW: Difference Between Tokenizer and filter

Thanks Shawn. This helps

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, March 2, 2016 11:04 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of 
> tokenizer and filter so I can understand why I should not have two tokenizer 
> in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a 
Tokenizer operates on an input that's a single string, turning it into a token 
stream, and a Filter uses a token stream for both input and output.  A 
CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is composed 
of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or 
more Filter entries.  Alternately, you can specify an Analyzer class, which is 
a lot like a Tokenizer.  An Analyzer is effectively the same thing as a 
tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the Tokenizer.  
CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts.

> My understanding is tokenizer is used to say how the content should be
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or anything 
else in the analysis chain.  It is controlled by the Lucene codec.  Only a very 
small part of the codec is configurable in Solr, but normally this does not 
need configuring.  The codec defaults are appropriate for the majority of use 
cases.

Thanks,
Shawn

RE: FW: Difference Between Tokenizer and filter

Reply via email to