On 12/30/12 4:22 AM, Marvin Humphrey wrote: > On Sat, Dec 29, 2012 at 7:22 AM, Aleksandar Radovanovic > <[email protected]> wrote: >> I was wondering, would it be possible to add a new feature to the >> indexing engine (or somehow simulate it) that will do EXACTLY opposite >> of Lucy::Analysis::SnowballStopFilter? In other words, instead of >> blocking a list of stopwords, indexing engine will index ONLY phrases >> supplied in the user list to the exact match. Or even better, prioritize >> them for indexing: index the user list first and then use Lucy analyzer >> for words that are not in the list. >> >> Why this can be useful? In chemistry for example, it is simply >> impossible to create a rule that will index chemical names correctly ( >> e.g. NH4+/H+K+/NH4+(H+), [Hg(CN)2], Ca(.-) just to name a few of >> thousands). Also, in a biomedical text some seemingly common words can >> for example, represent a gene or protein name which should not be >> stemmed. To summarize, this feature will allow one to create a correct >> index(es) of specialized terms. > I think you could achieve this now by extracting the list of terms yourself > prior to indexing and using a custom RegexTokenizer. > > my $tokenizer = Lucy::Analysis::RegexTokenizer->new(pattern => '\\S+'); > my $type = Lucy::Plan::FullTextType->new(analyzer => tokenizer); > $schema->spec_field(name => 'chemicals', type => $type); > > ... > > my @chemical_names = extract_chem_names($content); > my $chem_content = join(' ', @chemical_names); > $indexer->add_doc({ > content => $content, > chemicals => $chem_content, > ... > }); > > If the chemical names may contain whitespace, I'd suggest using "\x1F", the > ASCII "unit separator", as a delimiter. > > my $tokenizer = Lucy::Analysis::RegexTokenizer->new( > pattern => '[^\\x1F]+' > ); > > ... > > my $chem_content = join("\x1F", @chemical_names); > > At search-time, you'd need to duplicate the transform and feed the content to > an extra QueryParser. > > my $main_parser = Lucy::Search::QueryParser->new( > schema => $searcher->get_schema, > ); > my $chem_parser = Lucy::Search::QueryParser->new( > schema => $searcher->get_schema, > fields => ['chemicals'], > ); > my $main_query = $main_parser->parse($query_string); > my $chem_query = $chem_parser->parse(extract_chem_names($query_string)); > my $or_query = Lucy::Search::ORQuery->new( > children => [$main_query, $chem_query], > ); > my $hits = $searcher->hits(query => $or_query); > ... > > The tutorial documentation in Lucy::Docs::Tutorial::QueryObjects may give you > some ideas as well. > > Cheers, > > Marvin Humphrey > > Thank you Marvin, I tried what you have suggested! It works fine, but my main problem still remains: how to find and index *predefined* phrases. In your example this boils down to the implementation of /extract_chem_names($content). /
I was hoping to use some Lucy functionality for this - indexing the whole text, searching the index for predefined phrases and index them separately. But this does not work correctly for biomedical documents in which text often looks like random sequence of weird characters, and strange, no-language words which Lucy simply skips, or stems incorrectly. So, the core of my idea is to have something opposite to stopwords. A list of phrases which will be indexed without stemmer - exactly as they appear in the user supplied list. I was wondering why such a simple and obvious feature was not implemented - or am I missing something? Alex
