On 28.08.2013 18:17, Alexandre Patry wrote:
> On 2013-08-28 11:25, Peter Klügl wrote:
>> On 28.08.2013 16:52, Alexandre Patry wrote:
>>> Hi,
>>>
>>> I use RUTA and I want to delete an annotation if it is within the
>>> first 50 tokens of a document. I came up with the following rules :
>>>
>>> ANY{POSITION(Document, 1)-> Header}; // Annotate the
>>> first token in the document
>>> Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // Appends the
>>> 49 following tokens
>>> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};// Delete the
>>> first ToDelete if it is within the header
>>>
>>>
>>> These rules work as expected but they are *really* slow. Is there a
>>> faster way to achieve that?
>>>
>> Oh yes, the first rule is really slow. I always miss an action MARKFIRST
>> (as there is a MARKLAST). I will add it today or tomorrow.
>>
>> There are two reasons why the first rule is slow:
>> ANY has to look at all tokens and POSITION is just the slowest condition
>> in Ruta.
>> For now you could use a rule like:
>> ANY{STARTSWITH(Document)-> Header};
>> ... which avoids at least the POSITION condition.
>>
>> A simple test with a 200 W document:
>>
>> ...
>> ANY{POSITION(Document, 1)-> Header}; // [0.274s|93.52%]
>> Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // [0.090s|3.07%]
>> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.030s|1.02%]
>>
>> ...
>> ANY{STARTSWITH(Document)-> Header}; // [0.047s|50.00%]
>> Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // [0.029s|30.85%]
>> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.011s|11.7%]
>>
>> well, that's still slow (in debug mode) and I actually wonder why the
>> other rules are getting faster... but I hope that the performance will
>> soon be improved :-)
> Just tried it and it is much better, thanks!
>
> Many of my documents start with space, so I had to update the rules to :
>
> Document{-> ADDRETAINTYPE(SPACE, BREAK)};
> ANY{STARTSWITH(Document) -> Header};
> // if the first token is a space, use the first non-space following it
> Header{IS({SPACE, BREAK}) -> UNMARK(Header)} ANY*?
> ANY{-PARTOF({SPACE, BREAK}) -> MARK(Header)};
> Document{-> REMOVERETAINTYPE(SPACE, BREAK)};
>
> Header{->SHIFT(Header, 1, 2)} ANY[0,49];
> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};
>
> I will be happy to test drive MARKFIRST when it will be in trunk.
>
It's already in the trunk. If you want, then I can also think of
something that avoid the visibility problem.
Best,
Peter
> Alexandre
>