Re: can't remove duplicate Annotations with Java Set Collection

Kameron Cole Tue, 18 Nov 2014 11:08:12 -0800

Awesome.  Your change will work.  And i will try it, thank you!

But maybe you can help me to get this to work?   As I posted, if I use
Object as the parameter in the compare method signature, Eclipse is ok; but
when I change it to Annotation, it says I must override the methods - as
though something about Annotator confuses Eclipse.  Here's the code I
really want to work:



-----------------------------------

public static ArrayList<Annotation>  dedupe (AnnotationIndex<Annotation>
idx2){

        ArrayList<Annotation> tempList = new ArrayList<Annotation>(idx2.size
());
        FSIterator<Annotation> it2  = idx2.iterator();
        while(it2.hasNext())
        {

                tempList.add((Annotation) it2.next());

        }

        Set set = new TreeSet(new Comparator() {
                @Override
                public int compare(Annotation o1, Annotation o2) {
                        if(o1.getCoveredText()==o2.getCoveredText()){
                        return 0;
                }
                return 1;
                }
        });

        set.addAll(tempList);

        tempList.clear();
        tempList.addAll(set);
        System.out.println("templist length: "+tempList.size());
return tempList;

-----------------------------

But look:at what Eclipse gives me:






                                                                       
                                                                       
                                                                       
 Kameron Arthur Cole                                                   
 Watson Content                                                        
 Analytics Applications                                                
 and Support                                                           
 email:                                                                
 [email protected]                                                
 | Tel: 305-389-8512                                                   
 upload logs here                                                      
                                                                       
                                                                       
                                                                       
                                                                       
                                                                       






From:   Marshall Schor <[email protected]>
To:     [email protected]
Date:   11/18/2014 11:54 AM
Subject:        Re: can't remove duplicate Annotations with Java Set Collection



An even simpler approach:

Use a HashMap, where the key is the annotation.getCoveredText() and the
value is
the annotation, instead of a HashSet.

replace this (in your original):

// push tempList into HashSet
HashSet<Annotation> hs = new HashSet<Annotation>();
hs.addAll(tempList);


with

// push tempList into HashMap
HashMap<String, Annotation> hm = new HashSet<String, Annotation>();
for (Annotation a : tempList) {
  hm.put(a.getCoveredText(), a);
}

-Marshall

On 11/18/2014 9:45 AM, Marshall Schor wrote:
> Eclipse pointed out a bug in my code, fix is below
> On 11/18/2014 9:37 AM, Marshall Schor wrote:
>> Hi Kameron,
>>
>> Based on this code snip, the two "cat" annotations you create are
"different"
>> using the HashSet definition, because they correspond to two distinct
UIMA
>> Annotations.  You could, for instance, update one of them, and not the
other;
>> that it the sense in which they are distinct.  In the case below, the
two "cat"
>> annotations would have different begin and end offsets.
>>
>> I'm guessing that your goal was to to have one of the two cat
annotations be
>> dropped.
>>
>> You could do that by using your hash set approach, if you defined equal
to mean
>> that just the covered text of the annotation was equal.
>>
>> Here's one way to do this:  Create a "cover object" for your
annotations, that
>> contains a reference to the annotation and defines equals and hashcode
(you have
>> to define these together).  The easy way to do this is using Eclipse -
define a
>> new class: e.g.
>>
>> public class MyAnnotationWithSpecialEquals {
>>   final public Annotation annotation;   // the covered annotation
>>
>>   public MyAnnotationWithSpecialEquals(Annotation annotation) {
>>     this.annotation = annotation;
>>   }
>> }
>>
>> and then use Eclipse to define the equals and hashcode:  go to Menu ->
Source ->
>> Generate hashcode() and equals()
>> and have it generate one based on just "annotation".  This will not
(yet) be
>> correct - it should add two methods like this:
>>
>>   @Override
>>   public int hashCode() {
>>     final int prime = 31;
>>     int result = 1;
>>     result = prime * result + ((annotation == null) ? 0 :
annotation.hashCode());
>>     return result;
>>   }
>>
>>   @Override
>>   public boolean equals(Object obj) {
>>     if (this == obj)
>>       return true;
>>     if (obj == null)
>>       return false;
>>     if (getClass() != obj.getClass())
>>       return false;
>>     MyAnnotationWithSpecialEquals other =
(MyAnnotationWithSpecialEquals) obj;
>             // buggy lines
>>     if (annotation == null) {
>>       if (other.annotation != null)
>>         return false;
>             //  replace above with
>       if (annotation == null && other.annotation != null)
>         return false;
>>     } else if (!annotation.equals(other.annotation))
>>       return false;
>>     return true;
>>   }
>>
>> Now, to get these to be the definitions you want, which depend only on
the
>> covered text, modify these as follows:
>>
>> First, for hashCode, use only the string covered text:
>>
>>   @Override
>>   public int hashCode() {
>>     final int prime = 31;
>>     int result = 1;
>>     result = prime * result + ((annotation == null) ? 0 :
>> annotation.getCoveredText().hashCode());
>>     return result;
>>   }
>>
>> and for equals: replace test for annotation being "equal" with
>> annotation.getCoveredText() being "equal",
>> with some additional edge case testing in case of nulls:
>>
>> @Override
>>   public boolean equals(Object obj) {
>>     if (this == obj)
>>       return true;
>>     if (obj == null)
>>       return false;
>>     if (getClass() != obj.getClass())
>>       return false;
>>     MyAnnotationWithSpecialEquals other =
(MyAnnotationWithSpecialEquals) obj;
>>     if (annotation == null) {
>>       if (other.annotation != null)
>>         return false;
>>     } else {
>>       String coveredText = annotation.getCoveredText();
>>       if (coveredText == null) {
>>          if (other.annotation.getCoveredText() == null)
>>             return true;  // handle special case if covered text is null
>>          else return false;
>>       }
>>       // coveredText is not null
>>       if (!coveredText.equals(other.annotation.getCoveredText()))
>>         return false;
>>       return true;
>>     }
>>   }
>>
>> HTH.  -Marshall
>>
>>
>> On 11/17/2014 4:49 PM, Kameron Cole wrote:
>>> Input text:
>>>
>>> ------------------------------
>>>
>>> bird, cat, bush, cat
>>>
>>> ----------------------------
>>>
>>> Create the Annotations:
>>>
>>> -------------------------------
>>> docText = aJCas.getDocumentText();
>>>
>>> *int* index = docText.indexOf("cat");
>>> *while*(index >= 0) {
>>> *int* begin = index;
>>> *int* end = begin+3;
>>> Animal animal = *new* Animal(aJCas);
>>> animal.setBegin(begin);
>>> animal.setEnd(end);
>>> animal.addToIndexes();
>>>
>>>    index = docText.indexOf("cat", index+1);
>>> }
>>>
>>> index = docText.indexOf("bird");
>>> *while*(index >= 0) {
>>> *int* begin = index;
>>> *int* end = begin+4;
>>> Animal animal = *new* Animal(aJCas);
>>> animal.setBegin(begin);
>>> animal.setEnd(end);
>>> animal.addToIndexes();
>>>
>>>    index = docText.indexOf("bird", index+1);
>>> }
>>>
>>> index = docText.indexOf("bush");
>>> *while*(index >= 0) {
>>> *int* begin = index;
>>> *int* end = begin+4;
>>> Vegetable animal = *new* Vegetable(aJCas);
>>> animal.setBegin(begin);
>>> animal.setEnd(end);
>>> animal.addToIndexes();
>>>
>>>    index = docText.indexOf("bird", index+1);
>>> }
>>> ------------------------------------------------------
>>>
>>>
--------------------------------------------------------------------------------

>>>
>>>     *Kameron Arthur Cole
>>>     Watson Content Analytics Applications and Support
>>>     email: **[email protected]* <mailto:[email protected]>* |
Tel:
>>>     305-389-8512**
>>>     **upload logs here* <http://www.ecurep.ibm.com/app/upload>
>>>
>>>
>>>
>>>
>>>
>>>     <http://www.facebook.com/ibmwatson><https://twitter.com/@ibmwatson
><http://www.youtube.com/user/IBMWatsonSolutions/videos>
>>>
>>>
>>>
--------------------------------------------------------------------------------

>>>
>>>
>>>
>>> Inactive hide details for Marshall Schor ---11/17/2014 04:35:06
PM---Hi, Two
>>> Feature Structures are considered "equal" in the sMarshall Schor
---11/17/2014
>>> 04:35:06 PM---Hi, Two Feature Structures are considered "equal" in the
sense
>>> used by HashSet, if
>>>
>>> From: Marshall Schor <[email protected]>
>>> To: [email protected]
>>> Date: 11/17/2014 04:35 PM
>>> Subject: Re: can't remove duplicate Annotations with Java Set
Collection
>>>
>>>
--------------------------------------------------------------------------------

>>>
>>>
>>>
>>> Hi,
>>>
>>> Two Feature Structures are considered "equal" in the sense used by
HashSet, if
>>> fs1.equals(fs2).   The definition of "equals" for feature structures
is: they
>>> are equal if they refer to the same underlying CAS, and the same "spot"
in the
>>> the CAS Heap.
>>>
>>> How did you create the Annotations that you think are "equal" in the
HashSet
>>> sense?
>>>
>>> Here's an example of two annotations which are "equal" in the UIMA
sorted index
>>> sense, but unequal in the HashSet sense.
>>>
>>>    Annotation fs1 = new Annotation(myJCas, 0, 4); // create an instance
of
>>> Annotation in myJCas, with a begin = 0, and end = 4.
>>>    Annotation fs2 = new Annotation(myJCas, 0, 4); // create an instance
of
>>> Annotation in myJCas, with a begin = 0, and end = 4.
>>>
>>> These will be "equal" in the UIMA sense - the same kind of annotation,
in the
>>> same CAS, with the same feature values, but will be two distinct
feature
>>> structures, so HashSet will consider them to be unequal.
>>>
>>> Could this be what is happening in your case?  Please respond so we can
see if
>>> there's another straight-forward solution that does what you're looking
for.
>>>
>>> -Marshall
>>> on 11/17/2014 2:59 PM, Kameron Cole wrote:
>>>> Hello,
>>>>
>>>> I am trying to get rid of duplicates in the FSIndex.  I thought a very
>>>> clever way to do this would be to just push them into a Set Collection
in
>>>> Java, which does not allow duplicates. This is very (very) standard
Java:
>>>>
>>>> ArrayList al = new ArrayList();
>>>> // add elements to al, including duplicates
>>>> HashSet hs = new HashSet();
>>>> hs.addAll(al);
>>>> al.clear();
>>>> al.addAll(hs);
>>>>
>>>> This list will contain no duplicates.
>>>>
>>>> However, I am not getting this to work in my UIMA code:
>>>>
>>>>
>>>> System.out.println("Index size is: "+idx.size());
>>>>
>>>> AnnotationIndex<Annotation> idx = aJCas.getAnnotationIndex();
>>>>
>>>> ArrayList<Annotation> tempList = new ArrayList<Annotation>(idx.size
());
>>>>
>>>> FSIterator it  = idx.iterator();
>>>>
>>>> //load the Annotations into a temporary list.  includes duplicates
>>>>
>>>> while(it.hasNext())
>>>> {
>>>>
>>>> tempList.add((Annotation) it.next());
>>>>
>>>> }
>>>>
>>>> Iterator tempIt = tempList.iterator();
>>>>
>>>> // remove all Annotations from the index.  this works fine
>>>>
>>>> while(tempIt.hasNext()){
>>>> ((Annotation) tempIt.next()).removeFromIndexes(aJCas);
>>>> }
>>>>
>>>> // push tempList into HashSet
>>>>
>>>> HashSet<Annotation> hs = new HashSet<Annotation>();
>>>>
>>>> hs.addAll(tempList);
>>>>
>>>> // this should not allow duplicates
>>>>
>>>> System.out.println("HS length: "+hs.size()); // size should be less
the
>>>> size of the FSIndex by the number of duplicates.  it is not. This is
the
>>>> main problem
>>>>
>>>> tempList.clear();
>>>>
>>>> tempList.addAll(hs);
>>>>
>>>> System.out.println("templist length: "+tempList.size());
>>>>
>>>>
>>>> Iterator<Annotation> it2 = tempList.iterator(); // this should now be
the
>>>> clean list
>>>>
>>>>
>>>> while(it2.hasNext()){
>>>> it2.next().addToIndexes(aJCas);
>>>> }
>
>

Re: can't remove duplicate Annotations with Java Set Collection

Reply via email to