Re: Help: Problem in customized token filter

Steve Rowe Thu, 18 Jun 2015 22:19:01 -0700

Aman,

Solr uses the same Token filter instances over and over, calling reset() before 
sending each document through.  Your code sets “exhausted" to true and then 
never sets it back to false, so the next time the token filter instance is 
used, its “exhausted" value is still true, so no input stream tokens are 
concatenated ever again.


Does that make sense?

Steve
www.lucidworks.com

> On Jun 19, 2015, at 1:10 AM, Aman Tandon <amantandon...@gmail.com> wrote:
> 
> Hi Steve,
> 
> 
>> you never set exhausted to false, and when the filter got reused, *it
>> incorrectly carried state from the previous document.*
> 
> 
> Thanks for replying, but I am not able to understand this.
> 
> With Regards
> Aman Tandon
> 
> On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe <sar...@gmail.com> wrote:
> 
>> Hi Aman,
>> 
>> The admin UI screenshot you linked to is from an older version of Solr -
>> what version are you using?
>> 
>> Lots of extraneous angle brackets and asterisks got into your email and
>> made for a bunch of cleanup work before I could read or edit it.  In the
>> future, please put your code somewhere people can easily read it and
>> copy/paste it into an editor: into a github gist or on a paste service, etc.
>> 
>> Looks to me like your use of “exhausted” is unnecessary, and is likely the
>> cause of the problem you saw (only one document getting processed): you
>> never set exhausted to false, and when the filter got reused, it
>> incorrectly carried state from the previous document.
>> 
>> Here’s a simpler version that’s hopefully more correct and more efficient
>> (2 fewer copies from the StringBuilder to the final token).  Note: I didn’t
>> test it:
>> 
>>    https://gist.github.com/sarowe/9b9a52b683869ced3a17
>> 
>> Steve
>> www.lucidworks.com
>> 
>>> On Jun 18, 2015, at 11:33 AM, Aman Tandon <amantandon...@gmail.com>
>> wrote:
>>> 
>>> Please help, what wrong I am doing here. please guide me.
>>> 
>>> With Regards
>>> Aman Tandon
>>> 
>>> On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon <amantandon...@gmail.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I created a *token concat filter* to concat all the tokens from token
>>>> stream. It creates the concatenated token as expected.
>>>> 
>>>> But when I am posting the xml containing more than 30,000 documents,
>> then
>>>> only first document is having the data of that field.
>>>> 
>>>> *Schema:*
>>>> 
>>>> *<field name="titlex" type="text" indexed="true" stored="false"
>>>>> required="false" omitNorms="false" multiValued="false" />*
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> *<fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100">*
>>>>> *      <analyzer type="index">*
>>>>> *        <charFilter class="solr.HTMLStripCharFilterFactory"/>*
>>>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
>>>>> *        <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>*
>>>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
>>>>> *        <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
>>>>> outputUnigrams="true" tokenSeparator=""/>*
>>>>> *        <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English" protected="protwords.txt"/>*
>>>>> *        <filter
>>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
>>>>> *        <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="stemmed_synonyms_text_prime_ex_index.txt" ignoreCase="true"
>>>>> expand="true"/>*
>>>>> *      </analyzer>*
>>>>> *      <analyzer type="query">*
>>>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
>>>>> *        <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>*
>>>>> *        <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>> words="stopwords_text_prime_search.txt"
>> enablePositionIncrements="true" />*
>>>>> *        <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>*
>>>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
>>>>> *        <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English" protected="protwords.txt"/>*
>>>>> *        <filter
>>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
>>>>> *      </analyzer>**    </fieldType>*
>>>> 
>>>> 
>>>> Please help me, The code for the filter is as follows, please take a
>> look.
>>>> 
>>>> Here is the picture of what filter is doing
>>>> <http://i.imgur.com/THCsYtG.png?1>
>>>> 
>>>> The code of concat filter is :
>>>> 
>>>> *package com.xyz.analysis.concat;*
>>>>> 
>>>>> *import java.io.IOException;*
>>>>> 
>>>>> 
>>>>>> *import org.apache.lucene.analysis.TokenFilter;*
>>>>> 
>>>>> *import org.apache.lucene.analysis.TokenStream;*
>>>>> 
>>>>> *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
>>>>> 
>>>>> *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
>>>>> 
>>>>> *import
>>>>>> 
>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;*
>>>>> 
>>>>> *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
>>>>> 
>>>>> 
>>>>>> *public class ConcatenateWordsFilter extends TokenFilter {*
>>>>> 
>>>>> 
>>>>>> *  private CharTermAttribute charTermAttribute =
>>>>>> addAttribute(CharTermAttribute.class);*
>>>>> 
>>>>> *  private OffsetAttribute offsetAttribute =
>>>>>> addAttribute(OffsetAttribute.class);*
>>>>> 
>>>>> *  PositionIncrementAttribute posIncr =
>>>>>> addAttribute(PositionIncrementAttribute.class);*
>>>>> 
>>>>> *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
>>>>> 
>>>>> 
>>>>>> *  private StringBuilder stringBuilder = new StringBuilder();*
>>>>> 
>>>>> *  private boolean exhausted = false;*
>>>>> 
>>>>> 
>>>>>> *  /***
>>>>> 
>>>>> *   * Creates a new ConcatenateWordsFilter*
>>>>> 
>>>>> *   * @param input TokenStream that will be filtered*
>>>>> 
>>>>> *   */*
>>>>> 
>>>>> *  public ConcatenateWordsFilter(TokenStream input) {*
>>>>> 
>>>>> *    super(input);*
>>>>> 
>>>>> *  }*
>>>>> 
>>>>> 
>>>>>> *  /***
>>>>> 
>>>>> *   * {@inheritDoc}*
>>>>> 
>>>>> *   */*
>>>>> 
>>>>> *  @Override*
>>>>> 
>>>>> *  public final boolean incrementToken() throws IOException {*
>>>>> 
>>>>> *    while (!exhausted && input.incrementToken()) {*
>>>>> 
>>>>> *      char terms[] = charTermAttribute.buffer();*
>>>>> 
>>>>> *      int termLength = charTermAttribute.length();*
>>>>> 
>>>>> *      if(typeAtrr.type().equals("<ALPHANUM>")){*
>>>>> 
>>>>> *     stringBuilder.append(terms, 0, termLength);*
>>>>> 
>>>>> *      }*
>>>>> 
>>>>> *      charTermAttribute.copyBuffer(terms, 0, termLength);*
>>>>> 
>>>>> *      return true;*
>>>>> 
>>>>> *    }*
>>>>> 
>>>>> 
>>>>>> *    if (!exhausted) {*
>>>>> 
>>>>> *      exhausted = true;*
>>>>> 
>>>>> *      String sb = stringBuilder.toString();*
>>>>> 
>>>>> *      System.err.println("The Data got is "+sb);*
>>>>> 
>>>>> *      int sbLength = sb.length();*
>>>>> 
>>>>> *      //posIncr.setPositionIncrement(0);*
>>>>> 
>>>>> *      charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);*
>>>>> 
>>>>> *      offsetAttribute.setOffset(offsetAttribute.startOffset(),
>>>>>> offsetAttribute.startOffset()+sbLength);*
>>>>> 
>>>>> *      stringBuilder.setLength(0);*
>>>>> 
>>>>> *      //typeAtrr.setType("CONCATENATED");*
>>>>> 
>>>>> *      return true;*
>>>>> 
>>>>> *    }*
>>>>> 
>>>>> *    return false;*
>>>>> 
>>>>> *  }*
>>>>> 
>>>>> *}*
>>>>> 
>>>>> 
>>>> 
>>>> With Regards
>>>> Aman Tandon
>>>> 
>> 
>>

Re: Help: Problem in customized token filter

Reply via email to