Re: multi term, multi field, auto suggest

2010-02-01 Thread Lukas Kahwe Smith

On 29.01.2010, at 15:40, Lukas Kahwe Smith wrote:

 I am still a bit unsure how to handle both the lowercased and the case 
 preserved version:
 
 So here are some examples:
 UBS = ubs|UBS
 Kreuzstrasse = kreuzstrasse|Kreuzstrasse
 
 So when I type Kreu I would get a suggestion of Kreuzstrasse and with 
 kreu I would get kreuzstrasse.
 Since I do not expect any words to start with a lowercase letter and still 
 contain some upper case letter we should be fine with this approach.
 
 As in I doubt there would be stuff like fooBar which would lead to 
 suggestion both foobar and fooBar.
 
 How can I achieve this?


I just noticed that I need the same thing for the word delimiter splitter. As 
in some way to index both the splitted and the unsplitted version so that I can 
use it in a facet search.

Hans-Peter = Hans|Peter|Hans-Peter

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: multi term, multi field, auto suggest

2010-02-01 Thread Lukas Kahwe Smith

On 01.02.2010, at 13:27, Lukas Kahwe Smith wrote:

 
 On 29.01.2010, at 15:40, Lukas Kahwe Smith wrote:
 
 I am still a bit unsure how to handle both the lowercased and the case 
 preserved version:
 
 So here are some examples:
 UBS = ubs|UBS
 Kreuzstrasse = kreuzstrasse|Kreuzstrasse
 
 So when I type Kreu I would get a suggestion of Kreuzstrasse and with 
 kreu I would get kreuzstrasse.
 Since I do not expect any words to start with a lowercase letter and still 
 contain some upper case letter we should be fine with this approach.
 
 As in I doubt there would be stuff like fooBar which would lead to 
 suggestion both foobar and fooBar.
 
 How can I achieve this?
 
 
 I just noticed that I need the same thing for the word delimiter splitter. As 
 in some way to index both the splitted and the unsplitted version so that I 
 can use it in a facet search.
 
 Hans-Peter = Hans|Peter|Hans-Peter


Sorry for the monolog.
I did see 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg29786.html, which 
suggests a solution just for lowercase indexing with mixed case suggest via 
concatenating the lowercased version with some separator with the original 
version.

I guess what I could just do is feed in the same data multiple times and do the 
approach of [indexterm]|[original] in user land somehow

like Hans-Peter would be turned into 3 documents:
hans|Hans-Peter
peter|Hans-Peter
hans-peter|Hans-Peter

This solution would be quite cool indeed, since I could suggest Hans-Peter if 
someone searches for Peter.
Since I will just use this for a prefix search, I could just set the query 
analyzer to lowercase the search and it should find the results and I can then 
add some magic to the frontend display logic to split off the suggested 
original term.

I am not aware of any magic inside the schema.xml that could do this work for 
me though. I am using the DatabaseHandler to load the documents. I guess I 
could simply run the query multiple times, but that would screw up the indexing 
of the non auto suggest index. Then again maybe I want to totally separate the 
two anyways.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





multi term, multi field, auto suggest

2010-01-29 Thread Lukas Kahwe Smith
Hi,

So over the course of the last two weeks I have been trying to come up with an 
optimal solution for auto suggest in the project I am currently working on.
In the application we have names from people and companies. The companies can 
have german, english, italian or french names. people have an additional 
firstname field. We also want to do auto suggest on the street and city names 
as well as on emails and telefon numbers. as such we are treating phonenumbers 
as text.

We do have the option for the user to use phonetic searches or to split 
(especially the compound german words), but I guess we will leave that out of 
the auto suggest.
We do expect that some users will type in properly cased strings, while some 
may just type in all lowercase.
We are using the dismax defType for our normal queries.

There will probably be less than 20M entities.

As such I guess the best approach is to copy all of the above mentioned fields 
(name, firstname, city, street, email, telefon) into a new field called all.
It seems the best approach is to use facet.prefix for our requirements. We will 
therefore split of the last term in the query and pass it in as the 
facet.prefix while the rest is passed in as the q parameter.

Since facet's are driven out of the index, we will use the following type 
definition for this all field:
fieldType name=textplain class=solr.TextField 
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 
splitOnCaseChange=0/
  /analyzer
/fieldType

So essentially the idea is to just split on whitespace, remove stop words and 
word delimiters.

The query would then look something like the following if the user would enter 
Kaltenreider Ver:
http://localhost:8983/solr/core0/select?defType=dismaxqf=allq= 
Kaltenreiderindent=onfacet=onfacet.limit=10facet.mincount=1facet.field=allrows=0facet.prefix=Ver

Does this approach make sense so far?
Do you expect this to perform decently on a dual quad core machine with 16Gb of 
ram, albeit all of that will be shared with apache, mysql slave and a php app? 
Ah well questions like that are impossible to answer, so just trying to ask if 
you expect this to be really heavy. I noticed that in my initial testing with 
2M on my laptop facets seemed to be fine, though the first request was slow and 
the memory use spiked to 300MB. But I presume its just loading stuff into cache 
and concurrent requests shouldnt cause the memory use to go up linearly.

I am still a bit unsure how to handle both the lowercased and the case 
preserved version:

So here are some examples:
UBS = ubs|UBS
Kreuzstrasse = kreuzstrasse|Kreuzstrasse

So when I type Kreu I would get a suggestion of Kreuzstrasse and with 
kreu I would get kreuzstrasse.
Since I do not expect any words to start with a lowercase letter and still 
contain some upper case letter we should be fine with this approach.

As in I doubt there would be stuff like fooBar which would lead to suggestion 
both foobar and fooBar.

How can I achieve this?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org