On 10/12/12 5:10 PM, Nick Wellnhofer wrote:
>
> On 12/10/2012 15:27, Aleksandar Radovanovic wrote:
>> Thank you Nick. Could you possibly give me some more specific
>> guidelines?
>>
>> At the moment, all indexed words are "flat" with no semantics - which is
>> great for general purposes. However, if one focuses on, let's say
>> biomedical literature, one would like to distinguish what words
>> represent gene names, drugs names etc.. User would be able to compose
>> search like "[drug_dictionary_ID] AND headache" to get documents
>> containing all drug names related to headache.
>
> First, create a schema with two full-text fields. One named "text" for
> the document content, and another one named "dict" for dictionary IDs.
> Then, before indexing a document, create a list of dictionary IDs
> related to that document. Store the IDs in the "dict" field separated
> by whitespace and index the document.
>
> For the search part, you can write your own query parser, or use the
> excellent Search::Query module which supports the "field:value"
> syntax. Something like that should work:
>
> my $parser = Search::Query->parser(
> dialect => 'Lucy',
> default_field => 'text',
> );
> my $query = $parser->parse('dict:drug_dictionary_ID AND headache');
> my $lucy_query = $query->as_lucy_query();
> my $hits = $lucy_searcher->hits( query => $lucy_query );
>
> Hope this helps,
>
> Nick
>
Great! Thank you Nick. I'll try to implement your suggestion.
I just see a computation time related problem in creating string of
dictionary IDs that will accompany every document. I want to use the
PubMed corpus that has more than 23,000,000 documents. My ten biomedical
dictionaries has more than 500,000 terms. Also, dictionaries often
contain phrases, homonyms with multiple dictionary membership, synonyms
etc.. I guess, I will need an additional Lucy style, superfast module to
accomplish this task. Or perhaps, I am trying to use Lucy for something
she is not designed for :-)
Thank you again,
Alex