Aleksandar Radovanovic wrote on 12/30/12 5:21 AM:
> Thank you Marvin, I tried what you have suggested! It works fine, but my
> main problem still remains: how to find and index *predefined* phrases.
> In your example this boils down to the implementation of
> /extract_chem_names($content). /
>
> I was hoping to use some Lucy functionality for this - indexing the
> whole text, searching the index for predefined phrases and index them
> separately. But this does not work correctly for biomedical documents in
> which text often looks like random sequence of weird characters, and
> strange, no-language words which Lucy simply skips, or stems incorrectly.
>
> So, the core of my idea is to have something opposite to stopwords. A
> list of phrases which will be indexed without stemmer - exactly as they
> appear in the user supplied list. I was wondering why such a simple and
> obvious feature was not implemented - or am I missing something?
>
You're missing something. Stopword filtering happens *after* tokenizing in the
analysis chain; so too would your Goword filter. It's the tokenizing that's
problematic.
The problem isn't the lack of a GoWordFilter, it's the lack of a ChemTokenizer:
how to tokenize a block of text that contains *both* chemical strings and
narrative strings. It's like trying to apply an English stemmer to a text that
contains both English and French. The problem is: how to apply the rules for one
grammar against a text that contains mixed grammars that use the same alphabet.
Writing a single regex is practically impossible.
If you just wanted to pull out the chemical strings from your text, and ignore
everything else, that would be a fairly straightforward task. If you wanted to
ignore all the chemical strings, that too would be straightforward (that's what
basically happens by default). But you seem to want to combine them. That's not
simple or straightforward.
Marvin's suggestion tries to address the complexity you're after. If what you're
missing is an implementation of extract_chem_names(), that seems like a suitable
exercise for you to undertake, since that requires domain-specific knowledge. I
might start with something naive like:
my @chem_names = (
'NH4+/H+K+/NH4+(H+)',
'[Hg(CN)2]',
'Ca(.-)',
);
sub extract_chem_names {
my $text = shift;
my @matches;
for my $n (@chem_names) {
my $esc = quotemeta($n);
if ($text =~ m/$esc/) {
push @matches, $n;
}
}
return \@matches;
}
--
Peter Karman . http://peknet.com/ . [email protected]