Hi,
So over the course of the last two weeks I have been trying to come up with an
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can
have german, english, italian or french names. people have an additional
firstname field. We also want to do auto suggest on the street and city names
as well as on emails and telefon numbers. as such we are treating phonenumbers
as text.
We do have the option for the user to use phonetic searches or to split
(especially the compound german words), but I guess we will leave that out of
the auto suggest.
We do expect that some users will type in properly cased strings, while some
may just type in all lowercase.
We are using the dismax defType for our normal queries.
There will probably be less than 20M entities.
As such I guess the best approach is to copy all of the above mentioned fields
(name, firstname, city, street, email, telefon) into a new field called all.
It seems the best approach is to use facet.prefix for our requirements. We will
therefore split of the last term in the query and pass it in as the
facet.prefix while the rest is passed in as the q parameter.
Since facet's are driven out of the index, we will use the following type
definition for this all field:
fieldType name=textplain class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=0/
/analyzer
/fieldType
So essentially the idea is to just split on whitespace, remove stop words and
word delimiters.
The query would then look something like the following if the user would enter
Kaltenreider Ver:
http://localhost:8983/solr/core0/select?defType=dismaxqf=allq=
Kaltenreiderindent=onfacet=onfacet.limit=10facet.mincount=1facet.field=allrows=0facet.prefix=Ver
Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of
ram, albeit all of that will be shared with apache, mysql slave and a php app?
Ah well questions like that are impossible to answer, so just trying to ask if
you expect this to be really heavy. I noticed that in my initial testing with
2M on my laptop facets seemed to be fine, though the first request was slow and
the memory use spiked to 300MB. But I presume its just loading stuff into cache
and concurrent requests shouldnt cause the memory use to go up linearly.
I am still a bit unsure how to handle both the lowercased and the case
preserved version:
So here are some examples:
UBS = ubs|UBS
Kreuzstrasse = kreuzstrasse|Kreuzstrasse
So when I type Kreu I would get a suggestion of Kreuzstrasse and with
kreu I would get kreuzstrasse.
Since I do not expect any words to start with a lowercase letter and still
contain some upper case letter we should be fine with this approach.
As in I doubt there would be stuff like fooBar which would lead to suggestion
both foobar and fooBar.
How can I achieve this?
regards,
Lukas Kahwe Smith
m...@pooteeweet.org