Re: TeeTokenFilter and SinkTokenizer

Chris Hostetter Sat, 05 Jan 2008 18:03:20 -0800

: My initial (half-baked?) thinking is that we need the ability to name
: TokenStreams (Tokenizers and TokenFilters) so that we could do something like:
        ....
: Thus, each of the named filters create a TeeTokenFilter and have an associated
: SinkTokenizer.  Then, I can declare another analyzer that looks like:


I don't think it's that half backed ... but i'm not sure why it would need 
to be that implicit.  i think it would make a lot of sense to have a 
TeeTokenFilterFactory that takes the name of a "tee" to write to (ie: no 
implicit creation of TeeTokenFilter's between every existing Factory) ... 
the question is then what to do with those "tees"...

This seems very analogous to the way copyField works ... let the user 
specify that anytime something which went into a field named "foo" (or 
matching a pattern of "foo*") and comes out of a tee named "bar" it should 
be sent to a field named "baz" ... where "baz" must have a fieldtype that 
uses the SinkTokenizer (or perhaps the SinkTokenizer can be implicit at 
least, since we'll want to do error checking that you don't attempt to 
"tee" to a field that has some other Analyzer or TokenizerFactory...

    <fieldType name="text" class="solr.TextField" >
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
        <filter class="solr.TeeFilterFactory" tee="text_nostop" />
        <filter class="solr.WordDelimiterFilterFactory" .. />
        <filter class="solr.TeeFilterFactory" tee="text" />
      </analyzer>
      ...
    </fieldtype>
    <fieldtype name="caseInsensitive" class="solr.TextField"> 
      <analyzer type="index">
        <!-- no tokenizer, not an error, see below -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <!-- must still have a tokenizer here -->
        ...
   </fieldtype>
    <fieldtype name="properNouns" class="solr.TextField"> 
      <analyzer type="index">
        <!-- no tokenizer, not an error, see below -->
        <filter class="my.IgnoreAllImproperNounsTokenFilter" />
      </analyzer>
      <analyzer type="query">
        <!-- must still have a tokenizer here -->
        ...
   </fieldtype>
   <field name="body" type="text" />
   <!-- since these fields uses types whose index analyzer has no 
        tokenizer, it must be in a teeField declaration (or error at 
        startup), and you cannot index to it directly (error when adding doc)
   -->
   <field name="bodyCaseInsensitive" type="caseInsensitive" />
   <field name="nounsInBody          type="properNouns" />

   <teeField fromField="body" fromTee="text" toField="bodyCaseInsensitive" />
   <teeField fromField="body" fromTee="text_nostop" toField="nounsInBody"  />


...hmmmm, except ideally you'd want to be able to string together an 
arbitrary number of "pipelines" to make a nice big mesh graph of of 
interconnected analysis, and this would only let you do two ... Ah! except 
that fields don't have to be stored or indexed.  to the "toField" of a 
<teeField/> could exist purely to point at some bits of an analysis 
pipeline and then be the "fromField" of other <teeField/> rules.


-Hoss

Re: TeeTokenFilter and SinkTokenizer

Reply via email to