Jim,

The format is wrong. We already asked you to try using the
DictionaryBuilder tool:

input.txt:
--------
Lepirudin
Cetuximab
Dornase Alfa
Denileukin diftitox
Etanercept
Bivalirudin
Leuprolide
Peginterferon alfa-2a
Alteplase
--------

command:

bin/opennlp DictionaryBuilder -inputFile input.txt -outputFile output.xml
-encoding <encoding of inputFile>

output.xml
------
<?xml version="1.0" encoding="UTF-8"?>
<dictionary case_sensitive="false">
<entry>
<token>Etanercept</token>
</entry>
<entry>
<token>Dornase</token>
<token>Alfa</token>
</entry>
<entry>
<token>Peginterferon</token>
<token>alfa-2a</token>
</entry>
<entry>
<token>Alteplase</token>
</entry>
<entry>
<token>Leuprolide</token>
</entry>
<entry>
<token>Denileukin</token>
<token>diftitox</token>
</entry>
<entry>
<token>Bivalirudin</token>
</entry>
<entry>
<token>Cetuximab</token>
</entry>
<entry>
<token>Lepirudin</token>
</entry>
</dictionary>
------

Regards,
William

On Fri, Feb 24, 2012 at 8:38 AM, Jim - FooBar(); <[email protected]>wrote:

> On 24/02/12 05:09, James Kosin wrote:
>
>> Jim,
>>
>> Maybe the problem is how you have created the dictionary.  The
>> DictionaryNameFinder's find() method is a greedy method that will match
>> as many tokens as possible.
>> If it isn't matching more than one token than that is probably all the
>> dictionary contains per entry.
>>
>> Look at the simple example in the test packages for
>> opennlp.tools.namefind DictionaryNameFinderTest.java in the source
>> packages.
>>
>> There has a good example.
>>
>> James
>>
>
> Hi James,
>
> Well, the dictionary i created manually...basically i extracted all the
> drug-names from drugbank.xml and wrote them to a txt file (one entry per
> line). then i processed that text-file in order to produce the xml version
> of the proper dictionary. What i have after doing all that is a file with
> contents of the type:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <dictionary case_sensitive="false">
> <entry><token>Lepirudin</**token></entry>
> <entry><token>Cetuximab</**token></entry>
> <entry><token>Dornase Alfa</token></entry>
> <entry><token>Denileukin diftitox</token></entry>
> <entry><token>Etanercept</**token></entry>
> <entry><token>Bivalirudin</**token></entry>
> <entry><token>Leuprolide</**token></entry>
> <entry><token>Peginterferon alfa-2a</token></entry>
> <entry><token>Alteplase</**token></entry>
> ......
> ......
> ......etc etc
>
> As you can see some drugs are multi-word entities and also the first
> character of each word is capitalized. Whenever i call the find() method
> all i'm getting are the exact matches which means that case-sensitivity
> doesn ot work either!!! For example i'm getting "Cetuximab" but not
> "cetuximab"...so the problem is twofold...Firstly and more importantly I
> cannot find multi-word entities even though they do exist in the dictionary
> and the test data. Secondly, even though i'm setting case_sensitive="false"
> in both the xml file and the constructor of the DictionaryNameFinder, the
> actual results that i 'm getting are always case-sensitive!!!
>
> Can you see any problems with the xml file?
>
> Jim
>
>

Reply via email to