Hi Manuel, oh yes, forgot about the element name. Thank you for the patch, I will integrate it. The common procedure would be to attach the patch to a jira issue. I will take care of it, but you are of course also welcome to attach it :-)
Best, Peter Am 22.10.2015 um 12:43 schrieb Manuel Ciosici: > Hello Peter, > I looked a bit a the new regular expression and there are still some > cases that aren’t caught. More specifically, it won’t annotate XML > tags that have a dash in their name, so tags such as: > <first-name> > aren’t caught by the current regular expression. I’ve changed the > expression so that it works. What I did was change the \w+ part from > the tag name into \w[\w-]* since XML tag names can contain dashes, but > cannot start with dashes. I’ve also updated the unit test so that > there are tags with dashes and underscores and also one non-tag. > I’m attaching the SVN patch to this email. > Manuel > > > >Thanks Peter, > >The quotes are just normal quotes in the original source > >but the > mail software must have changed >this. Sorry about that > misunderstanding. > >Cheers >Mario > >> On 21/10/2015, at 16.03, Peter > Klügl <[email protected] <mailto:[email protected]>> > wrote: >> >> Hi, >> >> I extended the pattern to support dashes, but > not the other quotes. This >> can get arbitrary complex (and slow) if > any combination of unicode >> characters that look like quotes should > be supported. I still think that >> this is not valid xml. Can you > give me a link to the standard? >> >> It's maybe better to solve this > in a specific use case before applying >> the seeder. >> >> Best, >> > >> Peter >> >>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo: >>> I > believe it should be extended since I think that a RUTA user would > expect that >the MARKUP annotation indeed captures at least XML and > HTML markup properly. The examples >are from a Pub Med Central XML > file that follows the NISO JATS specification so I will assume >it is > proper formatted XML without knowing all the details of the spec. >>> > >>> We have managed to implement a crude workaround for now but let us > know when an improved >version becomes available. >>> >>> Cheers >>> > Mario >>> >>>> On 20 Oct 2015, at 17:56 , Peter Klügl > <[email protected] <mailto:[email protected]>> wrote: > >>>> >>>> Hi Mario, >>>> >>>> yes, and the different quote also causes > problems (are these valid?). >>>> >>>> The MARUP annotation is not > created by jflex like the other annoations, >>>> but by a > postprocessing step using an regular epression. This expression >>>> > does not cover theses cases (markupPattern in DefaultSeeder.java). > >>>> >>>> Should we extend it? >>>> >>>> Best, >>>> >>>> Peter >>>> > >>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo: >>>>> Hi Peter, > >>>>> >>>>> RUTA doesn’t seem to capture some XML markup with > attributes. Here are >some examples: >>>>> >>>>> <xref ref-type="bibr" > rid="b35-ehp0113-000220”> >>>>> <sec sec-type="methods”> >>>>> >>>>> > The above markup examples are totally missing in the TokenSeed > annotations. >I wonder whether it is related to the dash in the > attribute names since other markup without >this appear to be > captured. >>>>> >>>>> Can you confirm that the dash could cause the > problem? >>>>> >>>>> Cheers >>>>> Mario >> >
