Re: Applying Tokenizers and Filters to CopyFields

Martin Wunderlich Wed, 25 Mar 2015 14:15:04 -0700

Thanks a lot, Ahmet. I’ve just read up on this query field parameter and it 
sounds good. Since the field contents are currently all identical, I can’t 
really test it, yet.


Cheers, 

Martin
 



> Am 25.03.2015 um 21:27 schrieb Ahmet Arslan <iori...@yahoo.com.INVALID>:
> 
> Hi Martin,
> 
> fq means filter query. May be you want to use qf (query fields) parameter of 
> edismax?
> 
> 
> 
> On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich <martin...@gmx.net> 
> wrote:
> Hi all, 
> 
> I am wondering what the process is for applying Tokenizers and Filter (as 
> defined in the FieldType definition) to field contents that result from 
> CopyFields. To be more specific, in my Solr instance, Iwould like to support 
> query expansion by two means: removing stop words and adding inflected word 
> forms as synonyms. 
> 
> To use a specific example, let’s say I have the following sentence to be 
> indexed (from a Wittgenstein manuscript): 
> 
> "Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“
> 
> 
> This sentence will be indexed in a field called „original“ that is defined as 
> follows: 
> 
> <field name="original" type="text_original" indexed="true" stored="true" 
> required="true“/>
> 
>    <fieldType name="text_windex_original" class="solr.TextField" 
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> Then, in order to create fields for the two types of query expansion, I have 
> set up specific fields for this: 
> 
> - one field where stopwords are removed both on the indexed content and the 
> query. So, if the users is searching for a phrase like „der Sprache“, Solr 
> should still find the segment above, because the determiners („der“ and 
> „die“) are removed prior to indexing and prior to querying, respectively. 
> This field is defined as follows: 
> 
> <field name="stopwords_removed" type="text_stopwords_removed" indexed="true" 
> stored="true" required="true“/>
> 
>    <fieldType name="text_stopwords_removed" class="solr.TextField" 
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words=„stopwords_de.txt" format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords_de.txt" format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> - a second field where synonyms are added to the query so that more segments 
> will be found. For instance, if the user is searching for the plural form 
> „Sprachen“, Solr should return the segment above, due to this entry in the 
> synonyms file: "Sprache,Sprach,Sprachen“. This field is defined as follows: 
> 
> <field name="expanded" type="text_multiplied" indexed="true" stored="true" 
> required="true“/>expanded
> 
>    <fieldType name="text_expanded" class="solr.TextField" 
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords_de.txt" format="snowball"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords_de.txt" format="snowball"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_de.txt" 
> ignoreCase="true" expand="true"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> Finally, to avoid having to specify three fields with identical content in 
> the import documents, I am defining the two fields for query expansion as 
> copyFields: 
> 
>  <copyField source="original" dest="stopwords_removed"/>
>  <copyField source="original" dest="expanded“/>
> 
> Now, my expectation would be as follows: 
> - during import, two temporary fields are created by copying content from the 
> original field
> - these two temporary fields are then pre-processed as per the definitions 
> above
> - the pre-processed version of the text is added to the index
> - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der 
> Sprache“ and will always get the segment above as a matching result. 
> 
> However, what happens actually is that I get matches only for „Sprache“ and 
> „sprache“. 
> 
> The other thing that strikes as odd, is that when I restrict the search to 
> one of the fields only using the „fq“ parameter, I get no results. For 
> instance: 
> http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true
>  
> <http://localhost:8983/solr/windex/select?q=Sprache&fq=original&wt=json&indent=true>
> 
> will return no matches. I would expected that using the fq parameter the user 
> can specify what type of search (s)he would like to carry out: A standard 
> search (field original) or an expanded search (one of the other two fields). 
> 
> For debugging, I have checked the analysis and results seem ok (posted 
> below). 
> Apologies for the long post, but I am really a bit stuck here (even after 
> doing a lot of reading and googling). It is probably something simple that I 
> missing. 
> Thanks a lot in advance for any help. 
> 
> Cheers, 
> 
> Martin
> 
> 
> ST
> Was
> zum
> Wesen
> 
> der
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> SF
> Was
> zum
> Wesen
> 
> Welt
> gehört
> kann
> die
> Sprache
> nicht
> ausdrücken
> LCF
> was
> zum
> wesen
> 
> welt
> gehört
> kann
> die
> sprache
> nicht
> ausdrücken

Re: Applying Tokenizers and Filters to CopyFields

Reply via email to