Hi, I extended the pattern to support dashes, but not the other quotes. This can get arbitrary complex (and slow) if any combination of unicode characters that look like quotes should be supported. I still think that this is not valid xml. Can you give me a link to the standard?
It's maybe better to solve this in a specific use case before applying the seeder. Best, Peter Am 20.10.2015 um 19:22 schrieb Mario Gazzo: > I believe it should be extended since I think that a RUTA user would expect > that the MARKUP annotation indeed captures at least XML and HTML markup > properly. The examples are from a Pub Med Central XML file that follows the > NISO JATS specification so I will assume it is proper formatted XML without > knowing all the details of the spec. > > We have managed to implement a crude workaround for now but let us know when > an improved version becomes available. > > Cheers > Mario > >> On 20 Oct 2015, at 17:56 , Peter Klügl <[email protected]> wrote: >> >> Hi Mario, >> >> yes, and the different quote also causes problems (are these valid?). >> >> The MARUP annotation is not created by jflex like the other annoations, >> but by a postprocessing step using an regular epression. This expression >> does not cover theses cases (markupPattern in DefaultSeeder.java). >> >> Should we extend it? >> >> Best, >> >> Peter >> >> Am 20.10.2015 um 17:26 schrieb Mario Gazzo: >>> Hi Peter, >>> >>> RUTA doesn’t seem to capture some XML markup with attributes. Here are some >>> examples: >>> >>> <xref ref-type="bibr" rid="b35-ehp0113-000220”> >>> <sec sec-type="methods”> >>> >>> The above markup examples are totally missing in the TokenSeed annotations. >>> I wonder whether it is related to the dash in the attribute names since >>> other markup without this appear to be captured. >>> >>> Can you confirm that the dash could cause the problem? >>> >>> Cheers >>> Mario
