Thanks NIck, I was secretly hoping that there is a built-in functionnality that does this. Unfortunatly the solution is more complex. I'll look into it. Best
NT On Thu, Jan 17, 2013 at 2:48 PM, Nick Wellnhofer <[email protected]> wrote: > On 17/01/2013 10:21, Nikola Tulechki wrote: >> >> Hello >> >> I am using lucy on a technical documentation and I have a bunch of >> acronyms that must not be stemmed. >> Is there a way to add a stoplist to the stemmer so it skips some terms? > > > It can be done, but it's not trivial and probably not very performant. > First, you have to write your own Analyzer class in Perl. See the following > threads for some guidance: > > http://mail-archives.apache.org/mod_mbox/lucy-user/201111.mbox/%[email protected]%3E > http://mail-archives.apache.org/mod_mbox/lucy-user/201207.mbox/%[email protected]%3E > > We really need a cookbook entry describing how to write custom analyzers. > But to get started, here is some minimal skeleton code that I have used in > the past: > > package My::Custom::Analyzer; > use strict; > > use base qw(Lucy::Analysis::Analyzer); > > sub new { > my ($class, %args) = @_; > my $self = $class->SUPER::new(%args); > > # Setup your analyzer here > > return $self; > } > > sub transform { > my ($self, $inversion) = @_; > > while (my $token = $inversion->next) { > my $text = $token->get_text; > > # Transform $text here > > $token->set_text($text); > } > > $inversion->reset; > return $inversion; > } > > sub equals { > return 1; > } > > 1; > > For a proper implementation, you should also provide "dump" and "load" > methods and a real "equals" method but they're not really necessary for a > one-off job. Only remember to always reindex after changing the parameters > of your custom analyzer. Without "dump", "load" and "equals" you won't get > an error message in this case. > > Your custom analyzer should then stem the words that are not in your > stoplist one by one using the (undocumented) "split" method. So the > "transform text" part of your analyzer would look like: > > if (!$stoplist->{$text}) { > my $tokens = $stemmer->split($text); > $text = $tokens->[0]; > } > > Also note that you have to store member variables like "stoplist" and > "stemmer" of your analyzer class using the "inside-out" approach (one global > hash per variable). You'll find some example code showing how to do that in > the threads I mentioned above. > > Nick >
