I agree that fixing at the source is way to go.

Checking (but not fixing) an URI in XSLT 2.0 could be as simple as

@rdf:about castable as xs:anyURI
On Sat, 20 Jan 2018 at 10.28, Conal Tuohy <[email protected]> wrote:

> On 20 January 2018 at 18:37, Jean-Marc Vanel <[email protected]>
> wrote:
>
> > 2018-01-20 0:15 GMT+01:00 Andy Seaborne <[email protected]>:
> >
> > > Hi,
> > >
> > > Minimal, example file?
> > >
> >
> > ?xml version="1.0" encoding="UTF-8"?>
> > <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/"; >
> >     <foaf:Organization
> >         rdf:about="
> > https://www.communecter.org/#organization.detail.id.
> > 5898612440bb4e7d28cfc81a"
> > >
> >       <foaf:homepage rdf:resource="*
> http://[email protected]
> > <http://[email protected]>*"/>
> >   </foaf:Organization>
> > </rdf:RDF>
> >
>
>
> >
> > > Passing the input through a text processing stage (perl, sed ...) is
> > > probably the better way - fix up the errors.
> > >
> >
> > Sure, but I'm at the end of data flow: a crowd sourcing site gathers
> > (variable) quality data, then a developer converts several such sites in
> a
> > unique XML format, then me applying XSLT for RDF. So upstream it's
> curated,
> > and I report everything I find . And bad IRI's do not prevent the RDF to
> be
> > loaded in TDB .
> >
> >
> If you are generating the RDF/XML using XSLT, may I suggest you try
> to clean up the URIs in the XSLT? If you are using XSLT version 2 or newer,
> then you can even use xsl:analyze-string to check URIs with a regex, but
> even in XSLT 1 it should not be hard. Then you can repair (or log) errors
> like the one in your example, as well as ensuring that host names are in
> lower case, characters are correctly URI-encoded, etc.
>
> For example, here's an XSLT template I've used to repair incorrect
> URI-encoding in some URIs prior to ingestion as RDF:
>
> <xsl:template match="@href">
> <xsl:attribute name="href">
> <xsl:analyze-string select="." regex="(https?://[^\?/]*)?([^?#]*)(.*)">
> <xsl:matching-substring>
> <!-- regex-group(1) = scheme and host -->
> <!-- regex-group(2) = path -->
> <!-- regex-group(3) = query and fragment id -->
> <xsl:value-of select="regex-group(1)"/>
> <xsl:analyze-string select="regex-group(2)" regex="[/a-zA-Z0-9\-\._~]">
> <!-- matches any character OK in a URI path -->
> <xsl:matching-substring>
> <xsl:value-of select="."/>
> </xsl:matching-substring>
> <!-- characters that aren't OK get encoded -->
> <xsl:non-matching-substring>
> <xsl:value-of select="encode-for-uri(.)"/>
> </xsl:non-matching-substring>
> </xsl:analyze-string>
> <xsl:value-of select="regex-group(3)"/>
> </xsl:matching-substring>
> </xsl:analyze-string>
> </xsl:attribute>
> </xsl:template>
>
> I hope that's helpful!
>
> --
> Conal Tuohy
> http://conaltuohy.com/
> @conal_tuohy
> +61-466-324297
>

Reply via email to