Yes, I see - I could essentially do the tokenization "myself" (or using
some Analyzer chain) in an Update Processor. Yes I think that could
work. Thanks, Alex!
-Mike
On 4/10/14 10:09 PM, Alexandre Rafalovitch wrote:
It's an interesting question.
To start from, the copyField copies the source content, so there is no
source-related tokenization description. Only the target's one. So,
that approach is not suitable.
Regarding the lookups/auto-complete. There has been a bunch of various
implementations added recently, but they are not really documented.
Things like BlendedInfixSuggester are a bit hard to discover at the
moment. So, there might be something there if one digs a lot.
The other option is to do the tokenization in the
UpdateRequestProcessor chain. You could clone a field, and do some
processing so that by the time the content hits solr, it's already
pre-tokenized into multi-value field. Then, you could have
KeywordTokenizer on your collector field and separate URPs sub-chains
for each original fields that go into that. One related hack would be
to create a subclass of FieldMutatingUpdateProcessorFactory that wraps
an arbitrary tokenizer and splits out tokens as multi-value output.
This is a bit hazy, even in my own mind, but hopefully gives you
something new to think about.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Fri, Apr 11, 2014 at 8:05 AM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:
The lack of response to this question makes me think that either there is no
good answer, or maybe the question was too obtuse. So I'll give it one more
go with some more detail ...
My main goal is to implement autocompletion with a mix of words and short
phrases, where the words are drawn from the text of largish documents, and
the phrases are author names and document titles.
I think the best way to accomplish this is to concoct a single field that
contains data from these other "source" fields (as usual with copyField),
but with some of the fields treated as keywords (ie with their values
inserted as single tokens), and others tokenized. I believe this would be
possible at the Lucene level by calling Document.addField () with multiple
fields having the same name: some marked as TOKENIZED and others not. I
think the tokenized fields would have to share the same analyzer, but that's
OK for my case.
I can't see how this could be made to happen in Solr without a lot of custom
coding though. It seems as if the conversion from Solr fields to Lucene
fields is not an easy thing to influence. If anyone has an idea how to
achieve the subgoal, or perhaps a different way of getting at the main goal,
I'd love to hear about it.
So far my only other idea is to write some kind of custom analyzer that
treats short texts as keywords and tokenizes longer ones, which is probably
what I'll look at if nothing else comes up.
Thanks
Mike
On 4/9/2014 4:16 PM, Michael Sokolov wrote:
I think I would like to do something like copyfield from a bunch of fields
into a single field, but with different analysis for each source, and I'm
pretty sure that's not a thing. Is there some alternate way to accomplish my
goal?
Which is to have a suggester that suggests words from my full text field
and complete phrases drawn from my author and title fields all at the same
time. So If I could index author and title using KeyWordAnalyzer, and full
text tokenized, that would be the bees knees.
-Mike