Re: validate an RDF/XML file with bad URL's

Jean-Marc Vanel Sat, 20 Jan 2018 02:41:23 -0800

Thanks to Conal !

At first time your template made this :


  XTDE1140: Error in regular expression: net.sf.saxon.trans.XPathException:
The regular
  expression must not be one that matches a zero-length string

so I removed the ? after http block :
    <xsl:analyze-string select="." regex="(https?://[^\?/]*)([^?#]*)(.*)">

As I understand , your template offers a valuable  example of decomposing
an URL ( or URI or IRI ?) ,
it makes no message , and

   - it leaves unchanged scheme and host,
   - it fixes the URI encoding in path,
   - it leaves unchanged query and fragment id

I'm not sure if it analyses correctly an IRI with the special character ç :
http://quartier-français-editions.re/
<http://xn--quartier-franais-editions-5gc.re/>

Thanks to Martynas too!

All this is versioned :) here :
https://framagit.org/Scrutari/RDFexport


2018-01-20 10:28 GMT+01:00 Conal Tuohy <[email protected]>:

> On 20 January 2018 at 18:37, Jean-Marc Vanel <[email protected]>
> wrote:
>
> > 2018-01-20 0:15 GMT+01:00 Andy Seaborne <[email protected]>:
> >
> > > Hi,
> > >
> > > Minimal, example file?
> > >
> >
> > ?xml version="1.0" encoding="UTF-8"?>
> > <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/"; >
> >     <foaf:Organization
> >         rdf:about="
> > https://www.communecter.org/#organization.detail.id.
> > 5898612440bb4e7d28cfc81a"
> > >
> >       <foaf:homepage rdf:resource="*http://anais-
> [email protected]
> > <http://[email protected]>*"/>
> >   </foaf:Organization>
> > </rdf:RDF>
> >
>
>
> >
> > > Passing the input through a text processing stage (perl, sed ...) is
> > > probably the better way - fix up the errors.
> > >
> >
> > Sure, but I'm at the end of data flow: a crowd sourcing site gathers
> > (variable) quality data, then a developer converts several such sites in
> a
> > unique XML format, then me applying XSLT for RDF. So upstream it's
> curated,
> > and I report everything I find . And bad IRI's do not prevent the RDF to
> be
> > loaded in TDB .
> >
> >
> If you are generating the RDF/XML using XSLT, may I suggest you try
> to clean up the URIs in the XSLT? If you are using XSLT version 2 or newer,
> then you can even use xsl:analyze-string to check URIs with a regex, but
> even in XSLT 1 it should not be hard. Then you can repair (or log) errors
> like the one in your example, as well as ensuring that host names are in
> lower case, characters are correctly URI-encoded, etc.
>
> For example, here's an XSLT template I've used to repair incorrect
> URI-encoding in some URIs prior to ingestion as RDF:
>
> <xsl:template match="@href">
> <xsl:attribute name="href">
> <xsl:analyze-string select="." regex="(https?://[^\?/]*)?([^?#]*)(.*)">
> <xsl:matching-substring>
> <!-- regex-group(1) = scheme and host -->
> <!-- regex-group(2) = path -->
> <!-- regex-group(3) = query and fragment id -->
> <xsl:value-of select="regex-group(1)"/>
> <xsl:analyze-string select="regex-group(2)" regex="[/a-zA-Z0-9\-\._~]">
> <!-- matches any character OK in a URI path -->
> <xsl:matching-substring>
> <xsl:value-of select="."/>
> </xsl:matching-substring>
> <!-- characters that aren't OK get encoded -->
> <xsl:non-matching-substring>
> <xsl:value-of select="encode-for-uri(.)"/>
> </xsl:non-matching-substring>
> </xsl:analyze-string>
> <xsl:value-of select="regex-group(3)"/>
> </xsl:matching-substring>
> </xsl:analyze-string>
> </xsl:attribute>
> </xsl:template>
>
> I hope that's helpful!
>
> --
> Conal Tuohy
> http://conaltuohy.com/
> @conal_tuohy
> +61-466-324297
>



-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Re: validate an RDF/XML file with bad URL's

Reply via email to