Re: multiple analyzers for one field

Alexandre Rafalovitch Thu, 10 Apr 2014 19:11:28 -0700

It's an interesting question.

To start from, the copyField copies the source content, so there is no
source-related tokenization description. Only the target's one. So,
that approach is not suitable.

Regarding the lookups/auto-complete. There has been a bunch of various
implementations added recently, but they are not really documented.
Things like BlendedInfixSuggester are a bit hard to discover at the
moment. So, there might be something there if one digs a lot.

The other option is to do the tokenization in the
UpdateRequestProcessor chain. You could clone a field, and do some
processing so that by the time the content hits solr, it's already
pre-tokenized into multi-value field. Then, you could have
KeywordTokenizer on your collector field and separate URPs sub-chains
for each original fields that go into that. One related hack would be
to create a subclass of FieldMutatingUpdateProcessorFactory that wraps
an arbitrary tokenizer and splits out tokens as multi-value output.

This is a bit hazy, even in my own mind, but hopefully gives you
something new to think about.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Fri, Apr 11, 2014 at 8:05 AM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:
> The lack of response to this question makes me think that either there is no
> good answer, or maybe the question was too obtuse.  So I'll give it one more
> go with some more detail ...
>
> My main goal is to implement autocompletion with a mix of words and short
> phrases, where the words are drawn from the text of largish documents, and
> the phrases are author names and document titles.
>
> I think the best way to accomplish this is to concoct a single field that
> contains data from these other "source" fields (as usual with copyField),
> but with some of the fields treated as keywords (ie with their values
> inserted as single tokens), and others tokenized.  I believe this would be
> possible at the Lucene level by calling Document.addField () with multiple
> fields having the same name: some marked as TOKENIZED and others not.  I
> think the tokenized fields would have to share the same analyzer, but that's
> OK for my case.
>
> I can't see how this could be made to happen in Solr without a lot of custom
> coding though. It seems as if the conversion from Solr fields to Lucene
> fields is not an easy thing to influence.  If anyone has an idea how to
> achieve the subgoal, or perhaps a different way of getting at the main goal,
> I'd love to hear about it.
>
> So far my only other idea is to write some kind of custom analyzer that
> treats short texts as keywords and tokenizes longer ones, which is probably
> what I'll look at if nothing else comes up.
>
> Thanks
>
> Mike
>
>
>
> On 4/9/2014 4:16 PM, Michael Sokolov wrote:
>>
>> I think I would like to do something like copyfield from a bunch of fields
>> into a single field, but with different analysis for each source, and I'm
>> pretty sure that's not a thing. Is there some alternate way to accomplish my
>> goal?
>>
>> Which is to have a suggester that suggests words from my full text field
>> and complete phrases drawn from my author and title fields all at the same
>> time.  So If I could index author and title using KeyWordAnalyzer, and full
>> text tokenized, that would be the bees knees.
>>
>> -Mike
>
>

Re: multiple analyzers for one field

Reply via email to