On 12/30/12 7:12 PM, Peter Karman wrote:
> Aleksandar Radovanovic wrote on 12/30/12 5:21 AM:
>
>> Thank you Marvin, I tried what you have suggested! It works fine, but my
>> main problem still remains: how to find and index *predefined* phrases.
>> In your example this boils down to the implementation of
>> /extract_chem_names($content). /
>>
>> I was hoping to use some Lucy functionality for this - indexing the
>> whole text, searching the index for predefined phrases and index them
>> separately. But this does not work correctly for biomedical documents in
>> which text often looks like random sequence of weird characters, and
>> strange, no-language words which Lucy simply skips, or stems incorrectly.
>>
>> So, the core of my idea is to have something opposite to stopwords. A
>> list of phrases which will be indexed without stemmer - exactly as they
>> appear in the user supplied list. I was wondering why such a simple and
>> obvious feature was not implemented - or am I missing something?
>>
> You're missing something. Stopword filtering happens *after* tokenizing in the
> analysis chain; so too would your Goword filter. It's the tokenizing that's
> problematic.
>
> The problem isn't the lack of a GoWordFilter, it's the lack of a
> ChemTokenizer:
> how to tokenize a block of text that contains *both* chemical strings and
> narrative strings. It's like trying to apply an English stemmer to a text that
> contains both English and French. The problem is: how to apply the rules for
> one
> grammar against a text that contains mixed grammars that use the same
> alphabet.
> Writing a single regex is practically impossible.
>
> If you just wanted to pull out the chemical strings from your text, and ignore
> everything else, that would be a fairly straightforward task. If you wanted to
> ignore all the chemical strings, that too would be straightforward (that's
> what
> basically happens by default). But you seem to want to combine them. That's
> not
> simple or straightforward.
>
> Marvin's suggestion tries to address the complexity you're after. If what
> you're
> missing is an implementation of extract_chem_names(), that seems like a
> suitable
> exercise for you to undertake, since that requires domain-specific knowledge.
> I
> might start with something naive like:
>
> my @chem_names = (
> 'NH4+/H+K+/NH4+(H+)',
> '[Hg(CN)2]',
> 'Ca(.-)',
> );
>
> sub extract_chem_names {
> my $text = shift;
> my @matches;
> for my $n (@chem_names) {
> my $esc = quotemeta($n);
> if ($text =~ m/$esc/) {
> push @matches, $n;
> }
> }
> return \@matches;
> }
>
>
>
I see it clearly now. To express it in Lucy syntax, I would need some
expanded polyanalyzer|:|||||
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new (
dictionaries => [ $chemicals, $genes, $human_anatomy ],
language => 'en',
);
Since such a magic does not (yet:-) exists, I'll follow your advice.
Marvin, Peter, thank you so much for all your help!
Regards, Alex