[
https://issues.apache.org/jira/browse/SOLR-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514541
]
J.J. Larrea commented on SOLR-314:
----------------------------------
I agree that a stored-field pre-processor would be quite useful, but I'm not
sure the proposed scheme is the best way to define and control it... in
particular, f.<field>.analysis=<fieldType> to pull the analyzer definition out
of a different fieldType seems like a fragile and hacky construct. And it
blurs what I see as separate concerns, (1) having pre-storage processing part
of how a field is handled, versus (2) dynamically changing the handling of a
field. Another valid concern you raise (3) is how to handle duplicate indexed
values, but that should apply whether the duplicates arose from tokenization or
separate <field>...</field> values.
I wonder if a more robust implementation of the pre-processing concern would
simply be to add another analyzer type "store" to the current set "index" and
"query" which can be defined on a fieldType; naturally it wouldn't be in the
default set.
For your example,
<fieldType name="text_ws" class="solr.TextField" >
<analyzer type="store,index,query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
would ws-tokenize "aaa bbb ccc" and store 3 separate strings.
You raise the question of how to control the catenation of tokens. Simple
enough to create an UnTokenize token filter which can be added to the tail of
any analyzer chain. It could take arguments for the separator strings to use
based on whether tokens are overlapping or not, or better yet, printf formats
for both cases.
That would extend the store analyzer to quite different use-cases... for
example, semicolon-delimited author strings can be split, with each author run
through your CapitalizationFilter for storage, while for indexing punctuation
would be stripped and it would be lower-cased:
<fieldType name="text_ws" class="solr.TextField" >
<analyzer type="store">
<tokenizer class="solr.PatternTokenizerFactory"
pattern=";\s+"/>
<filter class="solr.CapitalizationFilterFactory"
onlyFirstWord="false"
keep="and or the is my of for de"
okPrefix="McK"
forceFirstLetter="true" />
<filter class="solr.UnTokenizerFilterFactory"
adjacent="; "/>
</analyzer>
<analyzer type="index,query"> <!-- type="index,query" is
optional -->
<tokenizer class="solr.PatternTokenizerFactory"
pattern="[,;|\s]+"/>
...
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
In a similar example, stored values could be run through the
HyphenatedWordsFilterFactory (and then untokenized) so they reflect what is
actually being indexed.
One could even store the result of analysis (perhaps in a CopyField) as a
visual token mapping to help diagnose indexing/analysis problems, concatenated
with something on the order of <filter class="solr.UnTokenizerFilterFactory"
adjacent=" " overlap=" / " missing="<null>" /> e.g. "<null> quick / fast
dog / canine jumped ..."
Then to address the other concern (2) of allowing user-control of field types,
one solution would be to recast the StoreAnalysisProcessor as say
DynamicFieldTypeProcessor, allowing f.<field>.type=<fieldType> when it is
inserted in the chain... e.g. for language-specific analysis, etc.
(It's late, I hope this all makes sense...)
> Store Analyzed token text from an incoming SolrInputDocument
> ------------------------------------------------------------
>
> Key: SOLR-314
> URL: https://issues.apache.org/jira/browse/SOLR-314
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Ryan McKinley
> Attachments: SOLR-314-StoreAnalysis.patch
>
>
> This is an UpdateRequestProcessor that runs incoming fields through a Field
> Analyzer and stores the output of each token as a field value.
> For Example. If you have a field type defined:
> <fieldType name="text_ws" class="solr.TextField" >
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> </analyzer>
> </fieldType>
> And send a request:
> /update?store.analysis=true&f.feature.analysis=text_ws
> <add> <doc>
> <field name="feature">aaa bbb ccc</field>
> </doc></add>
> The returned document will look like:
> <doc>
> <arr name="feature">
> <str>aaa</str>
> <str>bbb</str>
> <str>ccc</str>
> </arr>
> </doc>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.