: My initial (half-baked?) thinking is that we need the ability to name
: TokenStreams (Tokenizers and TokenFilters) so that we could do something like:
....
: Thus, each of the named filters create a TeeTokenFilter and have an associated
: SinkTokenizer. Then, I can declare another analyzer that looks like:
I don't think it's that half backed ... but i'm not sure why it would need
to be that implicit. i think it would make a lot of sense to have a
TeeTokenFilterFactory that takes the name of a "tee" to write to (ie: no
implicit creation of TeeTokenFilter's between every existing Factory) ...
the question is then what to do with those "tees"...
This seems very analogous to the way copyField works ... let the user
specify that anytime something which went into a field named "foo" (or
matching a pattern of "foo*") and comes out of a tee named "bar" it should
be sent to a field named "baz" ... where "baz" must have a fieldtype that
uses the SinkTokenizer (or perhaps the SinkTokenizer can be implicit at
least, since we'll want to do error checking that you don't attempt to
"tee" to a field that has some other Analyzer or TokenizerFactory...
<fieldType name="text" class="solr.TextField" >
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
<filter class="solr.TeeFilterFactory" tee="text_nostop" />
<filter class="solr.WordDelimiterFilterFactory" .. />
<filter class="solr.TeeFilterFactory" tee="text" />
</analyzer>
...
</fieldtype>
<fieldtype name="caseInsensitive" class="solr.TextField">
<analyzer type="index">
<!-- no tokenizer, not an error, see below -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- must still have a tokenizer here -->
...
</fieldtype>
<fieldtype name="properNouns" class="solr.TextField">
<analyzer type="index">
<!-- no tokenizer, not an error, see below -->
<filter class="my.IgnoreAllImproperNounsTokenFilter" />
</analyzer>
<analyzer type="query">
<!-- must still have a tokenizer here -->
...
</fieldtype>
<field name="body" type="text" />
<!-- since these fields uses types whose index analyzer has no
tokenizer, it must be in a teeField declaration (or error at
startup), and you cannot index to it directly (error when adding doc)
-->
<field name="bodyCaseInsensitive" type="caseInsensitive" />
<field name="nounsInBody type="properNouns" />
<teeField fromField="body" fromTee="text" toField="bodyCaseInsensitive" />
<teeField fromField="body" fromTee="text_nostop" toField="nounsInBody" />
...hmmmm, except ideally you'd want to be able to string together an
arbitrary number of "pipelines" to make a nice big mesh graph of of
interconnected analysis, and this would only let you do two ... Ah! except
that fields don't have to be stored or indexed. to the "toField" of a
<teeField/> could exist purely to point at some bits of an analysis
pipeline and then be the "fromField" of other <teeField/> rules.
-Hoss