[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

J.J. Larrea (JIRA) Sun, 22 Jul 2007 19:04:52 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514541
 ]


J.J. Larrea commented on SOLR-314:
----------------------------------

I agree that a stored-field pre-processor would be quite useful, but I'm not 
sure the proposed scheme is the best way to define and control it... in 
particular,  f.<field>.analysis=<fieldType> to pull the analyzer definition out 
of a different fieldType seems like a fragile and hacky construct.  And it 
blurs what I see as separate concerns, (1) having pre-storage processing part 
of how a field is handled, versus (2) dynamically changing the handling of a 
field.   Another valid concern you raise (3) is how to handle duplicate indexed 
values, but that should apply whether the duplicates arose from tokenization or 
separate <field>...</field> values.  

I wonder if a more robust implementation of the pre-processing concern would 
simply be to add another analyzer type "store" to the current set "index" and 
"query" which can be defined on a fieldType; naturally it wouldn't be in the 
default set.

For your example, 

  <fieldType name="text_ws" class="solr.TextField" >
      <analyzer type="store,index,query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
  </fieldType>

would ws-tokenize "aaa bbb ccc" and store 3 separate strings.

You raise the question of how to control the catenation of tokens.  Simple 
enough to create an UnTokenize token filter which can be added to the tail of 
any analyzer chain.  It could take arguments for the separator strings to use 
based on whether tokens are overlapping or not, or better yet, printf formats 
for both cases.

That would extend the store analyzer to quite different use-cases... for 
example, semicolon-delimited author strings can be split, with each author run 
through your CapitalizationFilter for storage, while for indexing punctuation 
would be stripped and it would be lower-cased:

        <fieldType name="text_ws" class="solr.TextField" >
                <analyzer type="store">
                        <tokenizer class="solr.PatternTokenizerFactory" 
pattern=";\s+"/>
                        <filter class="solr.CapitalizationFilterFactory"
                                onlyFirstWord="false"
                                keep="and or the is my of for de"
                                okPrefix="McK"
                                forceFirstLetter="true" />
                        <filter class="solr.UnTokenizerFilterFactory" 
adjacent="; "/>
                </analyzer>
                <analyzer type="index,query">  <!-- type="index,query" is 
optional -->
                        <tokenizer class="solr.PatternTokenizerFactory" 
pattern="[,;|\s]+"/>
                       ...
                        <filter class="solr.LowerCaseFilterFactory"/>
                </analyzer>
        </fieldType>

In a similar example, stored values could be run through the 
HyphenatedWordsFilterFactory (and then untokenized) so they reflect what is 
actually being indexed.

One could even store the result of analysis (perhaps in a CopyField) as a 
visual token mapping to help diagnose indexing/analysis problems, concatenated 
with something on the order of <filter class="solr.UnTokenizerFilterFactory" 
adjacent=" " overlap=" / " missing="&lt;null&gt;" /> e.g. "<null> quick / fast 
dog / canine jumped ..."

Then to address the other concern (2) of allowing user-control of field types, 
one solution would be to recast the StoreAnalysisProcessor as say 
DynamicFieldTypeProcessor, allowing f.<field>.type=<fieldType> when it is 
inserted in the chain... e.g. for language-specific analysis, etc.

(It's late, I hope this all makes sense...)



> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
>                 Key: SOLR-314
>                 URL: https://issues.apache.org/jira/browse/SOLR-314
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Ryan McKinley
>         Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field 
> Analyzer and stores the output of each token as a field value.
> For Example.  If you have a field type defined:
>   <fieldType name="text_ws" class="solr.TextField" >
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>   </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
>  <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
>  <arr name="feature">
>   <str>aaa</str>
>   <str>bbb</str>
>   <str>ccc</str>
>  </arr>
> </doc>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-314) Store Analyzed token text from an incoming SolrInputDocument

Reply via email to