Re: Help: Problem in customized token filter

Steve Rowe Thu, 18 Jun 2015 21:56:48 -0700

Hi Aman,

The admin UI screenshot you linked to is from an older version of Solr - what 
version are you using?


Lots of extraneous angle brackets and asterisks got into your email and made 
for a bunch of cleanup work before I could read or edit it.  In the future, 
please put your code somewhere people can easily read it and copy/paste it into 
an editor: into a github gist or on a paste service, etc.

Looks to me like your use of “exhausted” is unnecessary, and is likely the 
cause of the problem you saw (only one document getting processed): you never 
set exhausted to false, and when the filter got reused, it incorrectly carried 
state from the previous document.

Here’s a simpler version that’s hopefully more correct and more efficient (2 
fewer copies from the StringBuilder to the final token).  Note: I didn’t test 
it:

    https://gist.github.com/sarowe/9b9a52b683869ced3a17

Steve
www.lucidworks.com

> On Jun 18, 2015, at 11:33 AM, Aman Tandon <amantandon...@gmail.com> wrote:
> 
> Please help, what wrong I am doing here. please guide me.
> 
> With Regards
> Aman Tandon
> 
> On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon <amantandon...@gmail.com>
> wrote:
> 
>> Hi,
>> 
>> I created a *token concat filter* to concat all the tokens from token
>> stream. It creates the concatenated token as expected.
>> 
>> But when I am posting the xml containing more than 30,000 documents, then
>> only first document is having the data of that field.
>> 
>> *Schema:*
>> 
>> *<field name="titlex" type="text" indexed="true" stored="false"
>>> required="false" omitNorms="false" multiValued="false" />*
>> 
>> 
>> 
>> 
>> 
>> 
>>> *<fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100">*
>>> *      <analyzer type="index">*
>>> *        <charFilter class="solr.HTMLStripCharFilterFactory"/>*
>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
>>> *        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>*
>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
>>> *        <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
>>> outputUnigrams="true" tokenSeparator=""/>*
>>> *        <filter class="solr.SnowballPorterFilterFactory"
>>> language="English" protected="protwords.txt"/>*
>>> *        <filter
>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
>>> *        <filter class="solr.SynonymFilterFactory"
>>> synonyms="stemmed_synonyms_text_prime_ex_index.txt" ignoreCase="true"
>>> expand="true"/>*
>>> *      </analyzer>*
>>> *      <analyzer type="query">*
>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
>>> *        <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>*
>>> *        <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_text_prime_search.txt" enablePositionIncrements="true" />*
>>> *        <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>*
>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
>>> *        <filter class="solr.SnowballPorterFilterFactory"
>>> language="English" protected="protwords.txt"/>*
>>> *        <filter
>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
>>> *      </analyzer>**    </fieldType>*
>> 
>> 
>> Please help me, The code for the filter is as follows, please take a look.
>> 
>> Here is the picture of what filter is doing
>> <http://i.imgur.com/THCsYtG.png?1>
>> 
>> The code of concat filter is :
>> 
>> *package com.xyz.analysis.concat;*
>>> 
>>> *import java.io.IOException;*
>>> 
>>> 
>>>> *import org.apache.lucene.analysis.TokenFilter;*
>>> 
>>> *import org.apache.lucene.analysis.TokenStream;*
>>> 
>>> *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
>>> 
>>> *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
>>> 
>>> *import
>>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;*
>>> 
>>> *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
>>> 
>>> 
>>>> *public class ConcatenateWordsFilter extends TokenFilter {*
>>> 
>>> 
>>>> *  private CharTermAttribute charTermAttribute =
>>>> addAttribute(CharTermAttribute.class);*
>>> 
>>> *  private OffsetAttribute offsetAttribute =
>>>> addAttribute(OffsetAttribute.class);*
>>> 
>>> *  PositionIncrementAttribute posIncr =
>>>> addAttribute(PositionIncrementAttribute.class);*
>>> 
>>> *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
>>> 
>>> 
>>>> *  private StringBuilder stringBuilder = new StringBuilder();*
>>> 
>>> *  private boolean exhausted = false;*
>>> 
>>> 
>>>> *  /***
>>> 
>>> *   * Creates a new ConcatenateWordsFilter*
>>> 
>>> *   * @param input TokenStream that will be filtered*
>>> 
>>> *   */*
>>> 
>>> *  public ConcatenateWordsFilter(TokenStream input) {*
>>> 
>>> *    super(input);*
>>> 
>>> *  }*
>>> 
>>> 
>>>> *  /***
>>> 
>>> *   * {@inheritDoc}*
>>> 
>>> *   */*
>>> 
>>> *  @Override*
>>> 
>>> *  public final boolean incrementToken() throws IOException {*
>>> 
>>> *    while (!exhausted && input.incrementToken()) {*
>>> 
>>> *      char terms[] = charTermAttribute.buffer();*
>>> 
>>> *      int termLength = charTermAttribute.length();*
>>> 
>>> *      if(typeAtrr.type().equals("<ALPHANUM>")){*
>>> 
>>> *     stringBuilder.append(terms, 0, termLength);*
>>> 
>>> *      }*
>>> 
>>> *      charTermAttribute.copyBuffer(terms, 0, termLength);*
>>> 
>>> *      return true;*
>>> 
>>> *    }*
>>> 
>>> 
>>>> *    if (!exhausted) {*
>>> 
>>> *      exhausted = true;*
>>> 
>>> *      String sb = stringBuilder.toString();*
>>> 
>>> *      System.err.println("The Data got is "+sb);*
>>> 
>>> *      int sbLength = sb.length();*
>>> 
>>> *      //posIncr.setPositionIncrement(0);*
>>> 
>>> *      charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);*
>>> 
>>> *      offsetAttribute.setOffset(offsetAttribute.startOffset(),
>>>> offsetAttribute.startOffset()+sbLength);*
>>> 
>>> *      stringBuilder.setLength(0);*
>>> 
>>> *      //typeAtrr.setType("CONCATENATED");*
>>> 
>>> *      return true;*
>>> 
>>> *    }*
>>> 
>>> *    return false;*
>>> 
>>> *  }*
>>> 
>>> *}*
>>> 
>>> 
>> 
>> With Regards
>> Aman Tandon
>>

Re: Help: Problem in customized token filter

Reply via email to