Re: How can I do efficient FSIndex lookup?

2019-09-09 Thread Mario Juric
Thanks, Marshall.

This is definitely a possibility, although it means a tighter coupling of our 
upstream components with the downstream parts in the pipeline that produce the 
different outputs depending on configuration, so we need to think about how to 
do this properly.

Cheers,
Mario













> On 9 Sep 2019, at 17:51 , Marshall Schor  wrote:
> 
> ok.
> 
> Re: getting the top 5 or 10 items: Here's a technique you may find of use:
> 
> Put the items into a Java PriorityQueue.  Keep a piece of data which is the
> bottom item, and in your insert-into-the-queue code, check if the item
> to-be-inserted is below that, and if so, skip it.
> 
> This gives a very efficient way to get the top 5 or 10 items.
> 
> HTH. -Marshall
> 
> On 9/9/2019 4:09 AM, Mario Juric wrote:
>> Hi,
>> 
>> Once again thanks for the response. It is really appreciated :)
>> 
>> I tried the moveTo(fs) instead of just using an iterator constructed from 
>> the FS, and this appeared to give me all items of the specified type when I 
>> didn’t set any values on it, which was an accidental experiment, but when I 
>> set the key property to what I was searching for then I got zero items back. 
>> Not sure what I might be doing wrong here, but I have learned something 
>> maybe more importantly to our use case in the mean time: The cost of 
>> indexing exceeds by far the benefits of any expected lookup speed in our 
>> case.
>> 
>> We are annotating a number of items with a lot of extracted feature 
>> information, and the hope was to be able to quickly get top 5 or 10 or 
>> whatever of the items with this or that key, which is why it was sorted by 
>> key first in natural sort order and then by the value in reverse order, 
>> meaning higher value is better, so that we could quickly get to the first 
>> item with the right key and then start pulling the top most items until we 
>> have those that we need.
>> 
>> So even if I could get this to work optimally it would in our case not be 
>> beneficial given the cost of indexing. It seems we really need many of those 
>> queries before it pays of, since the amount of feature information is much 
>> larger than the items they are associated with, so I reached to the 
>> preliminary conclusion to not have features in any index at all and just 
>> using plain FS record structures instead. It appears in our case much 
>> cheaper to run through all target items, which there are comparatively less 
>> of, to find what we need than to index all associated features and find the 
>> relevant target items through feature look up.
>> 
>> Cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 6 Sep 2019, at 16:50 , Marshall Schor  wrote:
>>> 
>>> Please don't add to the indexes, the FS you're temporarily using as the 
>>> argument
>>> for the moveTo operation.  (and of course, if you don't add it, you won't 
>>> need
>>> to remove it...)
>>> 
>>> If you describe your use case in a bit more detail, I can perhaps comment on
>>> this more.
>>> 
>>> -Marshall
>>> 
>>> On 9/6/2019 2:50 AM, Mario Juric wrote:
 Hi,
 
 Thanks for responding.
 
 I tried with a temporary FS where the key value was set, but I got every 
 annotation from the index, so that didn’t appear to change anything, and 
 it also broke my unit tests immediately. I also  stepped through the 
 iterator implementation and found construction of the iterator quite a bit 
 complex with an FS, so that went over my head without spending time to get 
 a deeper understanding of the underlying index implementation. Therefore I 
 tried with an indexed FS and this seemed to return the correct items, but 
 it would be awkward having to add some FS to the index in order to 
 retrieve something else and then having to remove the FS from the index 
 again. I am now also in doubt about the insertion costs, but I haven’t 
 measured that yet.
 
 I am not sure how many use custom FSIndex, but currently the API doesn’t 
 really support very well the type of use cases that we are working with, 
 so this is a disappointment for us. Does UIMA 3 improve on this? We are 
 still on 2.x since we are awaiting the next major DKPro release with UIMA 
 3 because of dependencies.
 
 Thanks a lot and cheers,
 Mario
 
 
 
 
 
 
 
 
 
 
 
 
> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho  
> wrote:
> 
> On 5. Sep 2019, at 23:40, Marshall Schor  wrote:
>> The normal way to get the "binary search" kind of behavior is to get a 
>> plain
>> iterator over the sorted index, and then use the moveTo method, 
>> specifying a
>> target FS as the one to move to.  The target FS can be a "temporary" FS, 
>> one
>> that is never added to the indexes, itself; it is just used to supply 
>> values
>> used in the comparison.
> Is there a way to do this using a "temporary" 

Re: How can I do efficient FSIndex lookup?

2019-09-09 Thread Marshall Schor
ok.

Re: getting the top 5 or 10 items: Here's a technique you may find of use:

Put the items into a Java PriorityQueue.  Keep a piece of data which is the
bottom item, and in your insert-into-the-queue code, check if the item
to-be-inserted is below that, and if so, skip it.

This gives a very efficient way to get the top 5 or 10 items.

HTH. -Marshall

On 9/9/2019 4:09 AM, Mario Juric wrote:
> Hi,
>
> Once again thanks for the response. It is really appreciated :)
>
> I tried the moveTo(fs) instead of just using an iterator constructed from the 
> FS, and this appeared to give me all items of the specified type when I 
> didn’t set any values on it, which was an accidental experiment, but when I 
> set the key property to what I was searching for then I got zero items back. 
> Not sure what I might be doing wrong here, but I have learned something maybe 
> more importantly to our use case in the mean time: The cost of indexing 
> exceeds by far the benefits of any expected lookup speed in our case.
>
> We are annotating a number of items with a lot of extracted feature 
> information, and the hope was to be able to quickly get top 5 or 10 or 
> whatever of the items with this or that key, which is why it was sorted by 
> key first in natural sort order and then by the value in reverse order, 
> meaning higher value is better, so that we could quickly get to the first 
> item with the right key and then start pulling the top most items until we 
> have those that we need.
>
> So even if I could get this to work optimally it would in our case not be 
> beneficial given the cost of indexing. It seems we really need many of those 
> queries before it pays of, since the amount of feature information is much 
> larger than the items they are associated with, so I reached to the 
> preliminary conclusion to not have features in any index at all and just 
> using plain FS record structures instead. It appears in our case much cheaper 
> to run through all target items, which there are comparatively less of, to 
> find what we need than to index all associated features and find the relevant 
> target items through feature look up.
>
> Cheers,
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>> On 6 Sep 2019, at 16:50 , Marshall Schor  wrote:
>>
>> Please don't add to the indexes, the FS you're temporarily using as the 
>> argument
>> for the moveTo operation.  (and of course, if you don't add it, you won't 
>> need
>> to remove it...)
>>
>> If you describe your use case in a bit more detail, I can perhaps comment on
>> this more.
>>
>> -Marshall
>>
>> On 9/6/2019 2:50 AM, Mario Juric wrote:
>>> Hi,
>>>
>>> Thanks for responding.
>>>
>>> I tried with a temporary FS where the key value was set, but I got every 
>>> annotation from the index, so that didn’t appear to change anything, and it 
>>> also broke my unit tests immediately. I also  stepped through the iterator 
>>> implementation and found construction of the iterator quite a bit complex 
>>> with an FS, so that went over my head without spending time to get a deeper 
>>> understanding of the underlying index implementation. Therefore I tried 
>>> with an indexed FS and this seemed to return the correct items, but it 
>>> would be awkward having to add some FS to the index in order to retrieve 
>>> something else and then having to remove the FS from the index again. I am 
>>> now also in doubt about the insertion costs, but I haven’t measured that 
>>> yet.
>>>
>>> I am not sure how many use custom FSIndex, but currently the API doesn’t 
>>> really support very well the type of use cases that we are working with, so 
>>> this is a disappointment for us. Does UIMA 3 improve on this? We are still 
>>> on 2.x since we are awaiting the next major DKPro release with UIMA 3 
>>> because of dependencies.
>>>
>>> Thanks a lot and cheers,
>>> Mario
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
 On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho  
 wrote:

 On 5. Sep 2019, at 23:40, Marshall Schor  wrote:
> The normal way to get the "binary search" kind of behavior is to get a 
> plain
> iterator over the sorted index, and then use the moveTo method, 
> specifying a
> target FS as the one to move to.  The target FS can be a "temporary" FS, 
> one
> that is never added to the indexes, itself; it is just used to supply 
> values
> used in the comparison.
 Is there a way to do this using a "temporary" FS which does not take up 
 CAS heap
 space in UIMAv2?

 -- Richard
>


Re: How can I do efficient FSIndex lookup?

2019-09-09 Thread Mario Juric
Great!

Looking forward to try it out as soon we have the time. We will need a testing 
period before we put this into production, but the first step is to get it to 
build and work end to end.

Cheers
Mario













> On 8 Sep 2019, at 19:40 , Richard Eckart de Castilho  wrote:
> 
> Hi Mario,
> 
>> On 6. Sep 2019, at 08:50, Mario Juric  wrote:
>> 
>> We are still on 2.x since we are awaiting the next major DKPro release with 
>> UIMA 3 because of dependencies.
> 
> DKPro Core 2.0.0 is now available :)
> 
> Cheers,
> 
> -- Richard



Re: How can I do efficient FSIndex lookup?

2019-09-09 Thread Mario Juric
Hi,

Once again thanks for the response. It is really appreciated :)

I tried the moveTo(fs) instead of just using an iterator constructed from the 
FS, and this appeared to give me all items of the specified type when I didn’t 
set any values on it, which was an accidental experiment, but when I set the 
key property to what I was searching for then I got zero items back. Not sure 
what I might be doing wrong here, but I have learned something maybe more 
importantly to our use case in the mean time: The cost of indexing exceeds by 
far the benefits of any expected lookup speed in our case.

We are annotating a number of items with a lot of extracted feature 
information, and the hope was to be able to quickly get top 5 or 10 or whatever 
of the items with this or that key, which is why it was sorted by key first in 
natural sort order and then by the value in reverse order, meaning higher value 
is better, so that we could quickly get to the first item with the right key 
and then start pulling the top most items until we have those that we need.

So even if I could get this to work optimally it would in our case not be 
beneficial given the cost of indexing. It seems we really need many of those 
queries before it pays of, since the amount of feature information is much 
larger than the items they are associated with, so I reached to the preliminary 
conclusion to not have features in any index at all and just using plain FS 
record structures instead. It appears in our case much cheaper to run through 
all target items, which there are comparatively less of, to find what we need 
than to index all associated features and find the relevant target items 
through feature look up.

Cheers,
Mario












> On 6 Sep 2019, at 16:50 , Marshall Schor  wrote:
> 
> Please don't add to the indexes, the FS you're temporarily using as the 
> argument
> for the moveTo operation.  (and of course, if you don't add it, you won't need
> to remove it...)
> 
> If you describe your use case in a bit more detail, I can perhaps comment on
> this more.
> 
> -Marshall
> 
> On 9/6/2019 2:50 AM, Mario Juric wrote:
>> Hi,
>> 
>> Thanks for responding.
>> 
>> I tried with a temporary FS where the key value was set, but I got every 
>> annotation from the index, so that didn’t appear to change anything, and it 
>> also broke my unit tests immediately. I also  stepped through the iterator 
>> implementation and found construction of the iterator quite a bit complex 
>> with an FS, so that went over my head without spending time to get a deeper 
>> understanding of the underlying index implementation. Therefore I tried with 
>> an indexed FS and this seemed to return the correct items, but it would be 
>> awkward having to add some FS to the index in order to retrieve something 
>> else and then having to remove the FS from the index again. I am now also in 
>> doubt about the insertion costs, but I haven’t measured that yet.
>> 
>> I am not sure how many use custom FSIndex, but currently the API doesn’t 
>> really support very well the type of use cases that we are working with, so 
>> this is a disappointment for us. Does UIMA 3 improve on this? We are still 
>> on 2.x since we are awaiting the next major DKPro release with UIMA 3 
>> because of dependencies.
>> 
>> Thanks a lot and cheers,
>> Mario
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 5 Sep 2019, at 23:42 , Richard Eckart de Castilho  
>>> wrote:
>>> 
>>> On 5. Sep 2019, at 23:40, Marshall Schor  wrote:
 The normal way to get the "binary search" kind of behavior is to get a 
 plain
 iterator over the sorted index, and then use the moveTo method, specifying 
 a
 target FS as the one to move to.  The target FS can be a "temporary" FS, 
 one
 that is never added to the indexes, itself; it is just used to supply 
 values
 used in the comparison.
>>> Is there a way to do this using a "temporary" FS which does not take up CAS 
>>> heap
>>> space in UIMAv2?
>>> 
>>> -- Richard
>>