On 17/01/2013 10:21, Nikola Tulechki wrote:
Hello
I am using lucy on a technical documentation and I have a bunch of
acronyms that must not be stemmed.
Is there a way to add a stoplist to the stemmer so it skips some terms?
It can be done, but it's not trivial and probably not very performant.
First, you have to write your own Analyzer class in Perl. See the
following threads for some guidance:
http://mail-archives.apache.org/mod_mbox/lucy-user/201111.mbox/%[email protected]%3E
http://mail-archives.apache.org/mod_mbox/lucy-user/201207.mbox/%[email protected]%3E
We really need a cookbook entry describing how to write custom
analyzers. But to get started, here is some minimal skeleton code that I
have used in the past:
package My::Custom::Analyzer;
use strict;
use base qw(Lucy::Analysis::Analyzer);
sub new {
my ($class, %args) = @_;
my $self = $class->SUPER::new(%args);
# Setup your analyzer here
return $self;
}
sub transform {
my ($self, $inversion) = @_;
while (my $token = $inversion->next) {
my $text = $token->get_text;
# Transform $text here
$token->set_text($text);
}
$inversion->reset;
return $inversion;
}
sub equals {
return 1;
}
1;
For a proper implementation, you should also provide "dump" and "load"
methods and a real "equals" method but they're not really necessary for
a one-off job. Only remember to always reindex after changing the
parameters of your custom analyzer. Without "dump", "load" and "equals"
you won't get an error message in this case.
Your custom analyzer should then stem the words that are not in your
stoplist one by one using the (undocumented) "split" method. So the
"transform text" part of your analyzer would look like:
if (!$stoplist->{$text}) {
my $tokens = $stemmer->split($text);
$text = $tokens->[0];
}
Also note that you have to store member variables like "stoplist" and
"stemmer" of your analyzer class using the "inside-out" approach (one
global hash per variable). You'll find some example code showing how to
do that in the threads I mentioned above.
Nick