If I broke something, let me know so I can get original functionality
even if it involves some minor changes to incorporate.

I'll go ahead with a variable for the dictionary to store the longest
token count just in case I go that route.  The Index was a good idea;
but, it doesn't carry over the case sensitivity to its members that are
in the dictionary.

James

On 3/15/2012 8:06 PM, Jim - FooBar(); wrote:
> aaaaa ok i see what you mean! that makes perfect sense - thanks for
> being so thorough in explaining...
>
> I will check first thing tomorrow morning to see which one was the
> version that was returning multi-word entities properly...In fact i
> specifically remember getting back "both "folic acid" and "valproic
> acid" in the little paragraph i posted....anyway i'll let you know how
> i get on...
>
> Jim
>
>
>
> On 15/03/12 23:48, James Kosin wrote:
>> Jim,
>>
>> The hashcode is used to lookup and compare items more quickly in Java.
>> Basically, if the hashcode matches then the Java machine knows there is
>> a strong possibility two entries are the same.  The bad side is the
>> hashcode for the dictionary entries is based on the entire set of tokens
>> in the entry.  This means Java won't try comparing two items if the
>> hashcode isn't the same.  It is an optimization commonly used.
>>
>> In 1.5.3 we fixed a few more issues with the Dictionary to properly
>> handle the words and case sensitivity.  I also made some changes to take
>> out a small section in the DictionaryNameFinder's find() method that
>> used an Index created to determine if we should look and add another
>> word.  I may have re-factored this wrong and need to come up with a
>> better solution.
>>
>> We have several possibilities to fix and address this issue.  However,
>> some of them involve possibly making this an N^2 problem again for the
>> code.  I'm trying to avoid that and fix the problem correctly.  Maybe I
>> shouldn't have used hashcode so freely, but, it was how I found the
>> problem.  the hashcode for {"folic", "acid"} is different than that for
>> {"folic"}... so, the Dictionary doesn't bother comparing the two.  One
>> possibility is to have the entry for {"folic", "acid"} and {"folic"} be
>> the same... only drawback is we loose resolution in finding specific
>> names.
>> Another possible solution would be to keep a max_token_count for the
>> Dictionary to represent the number of tokens that the
>> DictionaryNameFinder would try to put together in the find() method...
>> limiting the greediness to the longest token-list in the dictionary.
>>
>> Could you check with 1.5.2 to see if you can find multi-word with/or
>> without the case sensitivity to verify.  If so it limits it to the
>> changes I made in the trunk.
>>
>> Thanks,
>> James
>>
>> On 3/15/2012 5:39 AM, Jim - FooBar(); wrote:
>>> So the problem is all in the hashcode.............
>>>
>>> Does that relate to the question i posted yesterday? I'm a bit
>>> confused...How is the .hashCode() related with not finding multi-word
>>> entities? and also, what happened between versions  1.5.2&  1.5.3
>>> snapshot cos i do remember being able to find multi-word entities at
>>> some point (i think with 1.5.2)...
>>>
>>> Jim
>>>
>>>
>

Reply via email to