Re: Solr stemming -> preserve original words

AHMET ARSLAN Sat, 24 Jan 2009 10:13:56 -0800

I still don't understand your final goal but if you want to get an output in 
the form of 
"run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner" 
you need to index your documents using standard analyzer. Walk through the 
index using org.apache.lucene.index.IndexReader and stem each term using 
stemmer. Storing stems (key) and orignal word list (value) in a map will give 
that kind of output.


However if seeing something like the following list (not exactly you want but 
similar) on schema.jsp will help you

run=>run
run=>running
run=>runner
run=>runners

add one line of code 

newstr = newstr + "=>" +  new String(termBuffer, 0, len);

to org.apache.solr.analysis.EnglishPorterFilterFactory.java between lines #116 
and #117.

Rename the file, compile the code, put your jar file to libs directory under 
your solr home. Now you can use your new FilfterFactory in your schema.xml


--- On Sat, 1/24/09, Thushara Wijeratna <thu...@gmail.com> wrote:

> From: Thushara Wijeratna <thu...@gmail.com>
> Subject: Re: Solr stemming -> preserve original words
> To: solr-user@lucene.apache.org, iori...@yahoo.com
> Date: Saturday, January 24, 2009, 1:53 AM
> Chris, Ahmet - thanks for the responses.
> 
> Ahmet - yes, i want to see "run" as a top term +
> the original words that
> formed that term
> The reason is that due to mis-stemming, the terms could
> become non-english.
> ex:  "permanent" would stem to "perm",
> "archive" would become "archiv".
> 
> I need to extract a set of keywords from the indexed
> content - I'd like
> these to be correct full english words.
> 
> thanks,
> thushara

Re: Solr stemming -> preserve original words

Reply via email to