On 28.08.2013 16:52, Alexandre Patry wrote:
> Hi,
>
> I use RUTA and I want to delete an annotation if it is within the
> first 50 tokens of a document. I came up with the following rules :
>
> ANY{POSITION(Document, 1)-> Header}; // Annotate the
> first token in the document
> Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // Appends the
> 49 following tokens
> ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};// Delete the
> first ToDelete if it is within the header
>
>
> These rules work as expected but they are *really* slow. Is there a
> faster way to achieve that?
>
Oh yes, the first rule is really slow. I always miss an action MARKFIRST
(as there is a MARKLAST). I will add it today or tomorrow.
There are two reasons why the first rule is slow:
ANY has to look at all tokens and POSITION is just the slowest condition
in Ruta.
For now you could use a rule like:
ANY{STARTSWITH(Document)-> Header};
... which avoids at least the POSITION condition.
A simple test with a 200 W document:
...
ANY{POSITION(Document, 1)-> Header}; // [0.274s|93.52%]
Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // [0.090s|3.07%]
ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.030s|1.02%]
...
ANY{STARTSWITH(Document)-> Header}; // [0.047s|50.00%]
Header{->SHIFT(Header, 1, 2)} ANY[0,49]; // [0.029s|30.85%]
ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.011s|11.7%]
well, that's still slow (in debug mode) and I actually wonder why the
other rules are getting faster... but I hope that the performance will
soon be improved :-)
Best,
Peter
> Thanks,
>
> Alexandre
>