Re: Reusable tokenstream

Emir Arnautović Wed, 22 Nov 2017 09:08:14 -0800

Hi Roxana,
The idea with update request processor is to have following parameters:
* inputField - document field with text to analyse
* sharedAnalysis - field type with shared analysis definition
* targetFields - comma separated list of fields where results should be stored.
* fieldSpecificAnalysis - comma separated list of field types that defines 
specifics for each field (reusing schema will have extra tokenizer that should 
be ignored)


Your update processor uses TeeSinkTokenFilter to create tokens for each field, 
but you do not write those tokens to index. You add new fields to document 
where each token is new value (or can concat and have whitespace tokenizer in 
indexing analysis chain of target field). You can remove inputField from 
document.

HTH,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Nov 2017, at 17:46, Roxana Danger <roxana.dan...@gmail.com> wrote:
> 
> Hi Emir,
> In this case, I need more control at Lucene level, so I have to use the
> lucene index writer directly. So, I can not use Solr for importing.
> Or, is there anyway I can add a tokenstream to a SolrInputDocument (is
> there any other class exposed by Solr during indexing that I can use for
> this purpose?).
> Am I correct or still missing something?
> Thank you.
> 
> 
> On Wed, Nov 22, 2017 at 11:33 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Roxana,
>> I think you can use https://lucene.apache.org/core/5_4_0/analyzers-common/
>> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html <
>> https://lucene.apache.org/core/5_4_0/analyzers-common/
>> org/apache/lucene/analysis/sinks/TeeSinkTokenFilter.html> like suggested
>> earlier.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 22 Nov 2017, at 11:43, Roxana Danger <roxana.dan...@gmail.com> wrote:
>>> 
>>> Hi Emir,
>>> Many thanks for your reply.
>>> The UpdateProcessor can do this work, but is analyzer.reusableTokenStream
>>> <https://lucene.apache.org/core/3_0_3/api/core/org/
>> apache/lucene/analysis/Analyzer.html#reusableTokenStream(java.lang.String,
>>> java.io.Reader)> the way to obtain a previous generated tokenstream? is
>> it
>>> guarantee to get access to the token stream and not reconstruct it?
>>> Thanks,
>>> Roxana
>>> 
>>> 
>>> On Wed, Nov 22, 2017 at 10:26 AM, Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
>>>> Hi Roxana,
>>>> I don’t think that it is possible. In some cases (seems like yours is
>> good
>>>> fit) you could create custom update request processor that would do the
>>>> shared analysis (you can have it defined in schema) and after analysis
>> use
>>>> those tokens to create new values for those two fields and remove source
>>>> value (or flag it as ignored in schema).
>>>> 
>>>> HTH,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 22 Nov 2017, at 11:09, Roxana Danger <roxana.dan...@gmail.com>
>> wrote:
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I would like to reuse the tokenstream generated for one field, to
>> create
>>>> a
>>>>> new tokenstream (adding a few filters to the available tokenstream),
>> for
>>>>> another field without the need of executing again the whole analysis.
>>>>> 
>>>>> The particular application is:
>>>>> - I have field *tokens* that uses an analyzer that generate the tokens
>>>> (and
>>>>> maintains the token type attributes)
>>>>> - I would like to have another two new fields: *verbs* and
>> *adjectives*.
>>>>> These should reuse the tokenstream generated for the field *tokens* and
>>>>> filter the verbs and adjectives for the respective fields.
>>>>> 
>>>>> Is this feasible? How should it be implemented?
>>>>> 
>>>>> Many thanks.
>>>> 
>>>> 
>> 
>>

Re: Reusable tokenstream

Reply via email to