Re: [lucy-user] Snowball stemmer stoplists

Nick Wellnhofer Thu, 17 Jan 2013 05:49:12 -0800

On 17/01/2013 10:21, Nikola Tulechki wrote:

Hello


I am using lucy on a technical documentation and I have a bunch of
acronyms that must not be stemmed.
Is there a way to add a stoplist to the stemmer so it skips some terms?

It can be done, but it's not trivial and probably not very performant.First, you have to write your own Analyzer class in Perl. See thefollowing threads for some guidance:


http://mail-archives.apache.org/mod_mbox/lucy-user/201111.mbox/%[email protected]%3E
http://mail-archives.apache.org/mod_mbox/lucy-user/201207.mbox/%[email protected]%3E

We really need a cookbook entry describing how to write customanalyzers. But to get started, here is some minimal skeleton code that Ihave used in the past:


    package My::Custom::Analyzer;
    use strict;

    use base qw(Lucy::Analysis::Analyzer);

    sub new {
        my ($class, %args) = @_;
        my $self = $class->SUPER::new(%args);

        # Setup your analyzer here

        return $self;
    }

    sub transform {
        my ($self, $inversion) = @_;

        while (my $token = $inversion->next) {
            my $text = $token->get_text;

            # Transform $text here

            $token->set_text($text);
        }

        $inversion->reset;
        return $inversion;
    }

    sub equals {
        return 1;
    }

    1;

For a proper implementation, you should also provide "dump" and "load"methods and a real "equals" method but they're not really necessary fora one-off job. Only remember to always reindex after changing theparameters of your custom analyzer. Without "dump", "load" and "equals"you won't get an error message in this case.

Your custom analyzer should then stem the words that are not in yourstoplist one by one using the (undocumented) "split" method. So the"transform text" part of your analyzer would look like:


    if (!$stoplist->{$text}) {
        my $tokens = $stemmer->split($text);
        $text = $tokens->[0];
    }

Also note that you have to store member variables like "stoplist" and"stemmer" of your analyzer class using the "inside-out" approach (oneglobal hash per variable). You'll find some example code showing how todo that in the threads I mentioned above.


Nick

Re: [lucy-user] Snowball stemmer stoplists

Reply via email to