On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic
<[email protected]> wrote:
> I was wondering, would it be possible to add a new feature to the
> indexing engine (or somehow simulate it) that will do EXACTLY opposite
> of Lucy::Analysis::SnowballStopFilter? In other words, instead of
> blocking a list of stopwords, indexing engine will index ONLY phrases
> supplied in the user list to the exact match. Or even better, prioritize
> them for indexing: index the user list first and then use Lucy analyzer
> for words that are not in the list.
>
> Why this can be useful? In chemistry for example, it is simply
> impossible to create a rule that will index chemical names correctly (
> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of
> thousands). Also, in a biomedical text some seemingly common words can
> for example, represent a gene or protein name which should not be
> stemmed. To summarize, this feature will allow one to create a correct
> index(es) of specialized terms.
I think you could achieve this now by extracting the list of terms yourself
prior to indexing and using a custom RegexTokenizer.
my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+');
my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer);
$schema->spec_field(name => 'chemicals', type => $type);
...
my @chemical_names = extract_chem_names($content);
my $chem_content = join(' ', @chemical_names);
$indexer->add_doc({
content => $content,
chemicals => $chem_content,
...
});
If the chemical names may contain whitespace, I'd suggest using "\x1F", the
ASCII "unit separator", as a delimiter.
my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
pattern => '[^\\x1F]+'
);
...
my $chem_content = join("\x1F", @chemical_names);
At search-time, you'd need to duplicate the transform and feed the content to
an extra QueryParser.
my $main_parser = Lucy::Search::QueryParser->new(
schema => $searcher->get_schema,
);
my $chem_parser = Lucy::Search::QueryParser->new(
schema => $searcher->get_schema,
fields => ['chemicals'],
);
my $main_query = $main_parser->parse($query_string);
my $chem_query = $chem_parser->parse(extract_chem_names($query_string));
my $or_query = Lucy::Search::ORQuery->new(
children => [$main_query, $chem_query],
);
my $hits = $searcher->hits(query => $or_query);
...
The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you
some ideas as well.
Cheers,
Marvin Humphrey