On 2013-08-28 11:25, Peter Klügl wrote:
On 28.08.2013 16:52, Alexandre Patry wrote:
Hi,

I use RUTA and I want to delete an annotation if it is within the
first 50 tokens of a document. I came up with the following rules :

    ANY{POSITION(Document, 1)-> Header};                // Annotate the
    first token in the document
    Header{->SHIFT(Header, 1, 2)} ANY[0,49];            // Appends the
    49 following tokens
    ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};// Delete the
    first ToDelete if it is within the header


These rules work as expected but they are *really* slow. Is there a
faster way to achieve that?

Oh yes, the first rule is really slow. I always miss an action MARKFIRST
(as there is a MARKLAST). I will add it today or tomorrow.

There are two reasons why the first rule is slow:
ANY has to look at all tokens and POSITION is just the slowest condition
in Ruta.
For now you could use a rule like:
ANY{STARTSWITH(Document)-> Header};
... which avoids at least the POSITION condition.

A simple test with a 200 W document:

...
ANY{POSITION(Document, 1)-> Header}; // [0.274s|93.52%]
Header{->SHIFT(Header, 1, 2)} ANY[0,49];  // [0.090s|3.07%]
ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.030s|1.02%]

...
ANY{STARTSWITH(Document)-> Header};  // [0.047s|50.00%]
Header{->SHIFT(Header, 1, 2)} ANY[0,49];  // [0.029s|30.85%]
ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)}; // [0.011s|11.7%]

well, that's still slow (in debug mode) and I actually wonder why the
other rules are getting faster... but I hope that the performance will
soon be improved :-)
Just tried it and it is much better, thanks!

Many of my documents start with space, so I had to update the rules to :

   Document{-> ADDRETAINTYPE(SPACE, BREAK)};
   ANY{STARTSWITH(Document) -> Header};
   // if the first token is a space, use the first non-space following it
   Header{IS({SPACE, BREAK}) -> UNMARK(Header)} ANY*?
   ANY{-PARTOF({SPACE, BREAK}) -> MARK(Header)};
   Document{-> REMOVERETAINTYPE(SPACE, BREAK)};

   Header{->SHIFT(Header, 1, 2)} ANY[0,49];
   ToDelete{POSITION(Header, 1) -> UNMARK(ToDelete)};

I will be happy to test drive MARKFIRST when it will be in trunk.

Alexandre

--
Alexandre Patry, Ph.D
Chercheur / Researcher
http://KeaText.com

Transformez vos documents en outils de décision
<< Turn your documents into decision tools

Reply via email to