Re: [lucy-user] New feature suggestion

Marvin Humphrey Sat, 29 Dec 2012 17:22:50 -0800

On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic
<[email protected]> wrote:
> I was wondering, would it be possible to add a new feature  to the
> indexing engine (or somehow simulate it) that will do EXACTLY opposite
> of Lucy::Analysis::SnowballStopFilter? In other words, instead of
> blocking a list of stopwords, indexing engine will index ONLY phrases
> supplied in the user list to the exact match. Or even better, prioritize
> them for indexing: index the user list first and then use Lucy analyzer
> for words that are not in the list.
>
> Why this can be useful? In chemistry for example, it is simply
> impossible to create a rule that will index chemical names correctly (
> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of
> thousands). Also, in a biomedical text some seemingly common words can
> for example, represent a gene or protein name which should not be
> stemmed.  To summarize, this feature will allow one to create a correct
> index(es) of specialized terms.


I think you could achieve this now by extracting the list of terms yourself
prior to indexing and using a custom RegexTokenizer.

    my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+');
    my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer);
    $schema->spec_field(name => 'chemicals', type => $type);

    ...

    my @chemical_names = extract_chem_names($content);
    my $chem_content = join(' ', @chemical_names);
    $indexer->add_doc({
        content   => $content,
        chemicals => $chem_content,
        ...
    });

If the chemical names may contain whitespace, I'd suggest using "\x1F", the
ASCII "unit separator", as a delimiter.

    my $tokenizer = Lucy::Analysis::RegexTokenizer->new(
        pattern => '[^\\x1F]+'
    );

    ...

    my $chem_content = join("\x1F", @chemical_names);

At search-time, you'd need to duplicate the transform and feed the content to
an extra QueryParser.

    my $main_parser = Lucy::Search::QueryParser->new(
        schema => $searcher->get_schema,
    );
    my $chem_parser = Lucy::Search::QueryParser->new(
        schema => $searcher->get_schema,
        fields => ['chemicals'],
    );
    my $main_query = $main_parser->parse($query_string);
    my $chem_query = $chem_parser->parse(extract_chem_names($query_string));
    my $or_query = Lucy::Search::ORQuery->new(
        children => [$main_query, $chem_query],
    );
    my $hits = $searcher->hits(query => $or_query);
    ...

The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you
some ideas as well.

Cheers,

Marvin Humphrey

Re: [lucy-user] New feature suggestion

Reply via email to