subject:"multi term, multi field, auto suggest"

Re: multi term, multi field, auto suggest

2010-02-01 Thread Lukas Kahwe Smith


On 29.01.2010, at 15:40, Lukas Kahwe Smith wrote:

 I am still a bit unsure how to handle both the lowercased and the case 
 preserved version:
 
 So here are some examples:
 UBS = ubs|UBS
 Kreuzstrasse = kreuzstrasse|Kreuzstrasse
 
 So when I type Kreu I would get a suggestion of Kreuzstrasse and with 
 kreu I would get kreuzstrasse.
 Since I do not expect any words to start with a lowercase letter and still 
 contain some upper case letter we should be fine with this approach.
 
 As in I doubt there would be stuff like fooBar which would lead to 
 suggestion both foobar and fooBar.
 
 How can I achieve this?


I just noticed that I need the same thing for the word delimiter splitter. As 
in some way to index both the splitted and the unsplitted version so that I can 
use it in a facet search.

Hans-Peter = Hans|Peter|Hans-Peter

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: multi term, multi field, auto suggest

2010-02-01 Thread Lukas Kahwe Smith


On 01.02.2010, at 13:27, Lukas Kahwe Smith wrote:

 
 On 29.01.2010, at 15:40, Lukas Kahwe Smith wrote:
 
 I am still a bit unsure how to handle both the lowercased and the case 
 preserved version:
 
 So here are some examples:
 UBS = ubs|UBS
 Kreuzstrasse = kreuzstrasse|Kreuzstrasse
 
 So when I type Kreu I would get a suggestion of Kreuzstrasse and with 
 kreu I would get kreuzstrasse.
 Since I do not expect any words to start with a lowercase letter and still 
 contain some upper case letter we should be fine with this approach.
 
 As in I doubt there would be stuff like fooBar which would lead to 
 suggestion both foobar and fooBar.
 
 How can I achieve this?
 
 
 I just noticed that I need the same thing for the word delimiter splitter. As 
 in some way to index both the splitted and the unsplitted version so that I 
 can use it in a facet search.
 
 Hans-Peter = Hans|Peter|Hans-Peter


Sorry for the monolog.
I did see 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg29786.html, which 
suggests a solution just for lowercase indexing with mixed case suggest via 
concatenating the lowercased version with some separator with the original 
version.

I guess what I could just do is feed in the same data multiple times and do the 
approach of [indexterm]|[original] in user land somehow

like Hans-Peter would be turned into 3 documents:
hans|Hans-Peter
peter|Hans-Peter
hans-peter|Hans-Peter

This solution would be quite cool indeed, since I could suggest Hans-Peter if 
someone searches for Peter.
Since I will just use this for a prefix search, I could just set the query 
analyzer to lowercase the search and it should find the results and I can then 
add some magic to the frontend display logic to split off the suggested 
original term.

I am not aware of any magic inside the schema.xml that could do this work for 
me though. I am using the DatabaseHandler to load the documents. I guess I 
could simply run the query multiple times, but that would screw up the indexing 
of the non auto suggest index. Then again maybe I want to totally separate the 
two anyways.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

multi term, multi field, auto suggest

2010-01-29 Thread Lukas Kahwe Smith

Hi,

So over the course of the last two weeks I have been trying to come up with an
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can
have german, english, italian or french names. people have an additional
firstname field. We also want to do auto suggest on the street and city names
as well as on emails and telefon numbers. as such we are treating phonenumbers
as text.

We do have the option for the user to use phonetic searches or to split
(especially the compound german words), but I guess we will leave that out of
the auto suggest.
We do expect that some users will type in properly cased strings, while some
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields
(name, firstname, city, street, email, telefon) into a new field called all.
It seems the best approach is to use facet.prefix for our requirements. We will
therefore split of the last term in the query and pass it in as the
facet.prefix while the rest is passed in as the q parameter.

Since facet's are driven out of the index, we will use the following type
definition for this all field:
fieldType name=textplain class=solr.TextField
positionIncrementGap=100
analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=0/
/analyzer
/fieldType

So essentially the idea is to just split on whitespace, remove stop words and
word delimiters.

The query would then look something like the following if the user would enter
Kaltenreider Ver:
http://localhost:8983/solr/core0/select?defType=dismaxqf=allq=
Kaltenreiderindent=onfacet=onfacet.limit=10facet.mincount=1facet.field=allrows=0facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of
ram, albeit all of that will be shared with apache, mysql slave and a php app?
Ah well questions like that are impossible to answer, so just trying to ask if
you expect this to be really heavy. I noticed that in my initial testing with
2M on my laptop facets seemed to be fine, though the first request was slow and
the memory use spiked to 300MB. But I presume its just loading stuff into cache
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case
preserved version:

So here are some examples:
UBS = ubs|UBS
Kreuzstrasse = kreuzstrasse|Kreuzstrasse

So when I type Kreu I would get a suggestion of Kreuzstrasse and with
kreu I would get kreuzstrasse.
Since I do not expect any words to start with a lowercase letter and still
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like fooBar which would lead to suggestion
both foobar and fooBar.

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: multi term, multi field, auto suggest

Re: multi term, multi field, auto suggest

multi term, multi field, auto suggest

3 matches

Site Navigation

Mail list logo

Footer information