I agree that fixing at the source is way to go. Checking (but not fixing) an URI in XSLT 2.0 could be as simple as
@rdf:about castable as xs:anyURI On Sat, 20 Jan 2018 at 10.28, Conal Tuohy <[email protected]> wrote: > On 20 January 2018 at 18:37, Jean-Marc Vanel <[email protected]> > wrote: > > > 2018-01-20 0:15 GMT+01:00 Andy Seaborne <[email protected]>: > > > > > Hi, > > > > > > Minimal, example file? > > > > > > > ?xml version="1.0" encoding="UTF-8"?> > > <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/" > > > <foaf:Organization > > rdf:about=" > > https://www.communecter.org/#organization.detail.id. > > 5898612440bb4e7d28cfc81a" > > > > > <foaf:homepage rdf:resource="* > http://[email protected] > > <http://[email protected]>*"/> > > </foaf:Organization> > > </rdf:RDF> > > > > > > > > > Passing the input through a text processing stage (perl, sed ...) is > > > probably the better way - fix up the errors. > > > > > > > Sure, but I'm at the end of data flow: a crowd sourcing site gathers > > (variable) quality data, then a developer converts several such sites in > a > > unique XML format, then me applying XSLT for RDF. So upstream it's > curated, > > and I report everything I find . And bad IRI's do not prevent the RDF to > be > > loaded in TDB . > > > > > If you are generating the RDF/XML using XSLT, may I suggest you try > to clean up the URIs in the XSLT? If you are using XSLT version 2 or newer, > then you can even use xsl:analyze-string to check URIs with a regex, but > even in XSLT 1 it should not be hard. Then you can repair (or log) errors > like the one in your example, as well as ensuring that host names are in > lower case, characters are correctly URI-encoded, etc. > > For example, here's an XSLT template I've used to repair incorrect > URI-encoding in some URIs prior to ingestion as RDF: > > <xsl:template match="@href"> > <xsl:attribute name="href"> > <xsl:analyze-string select="." regex="(https?://[^\?/]*)?([^?#]*)(.*)"> > <xsl:matching-substring> > <!-- regex-group(1) = scheme and host --> > <!-- regex-group(2) = path --> > <!-- regex-group(3) = query and fragment id --> > <xsl:value-of select="regex-group(1)"/> > <xsl:analyze-string select="regex-group(2)" regex="[/a-zA-Z0-9\-\._~]"> > <!-- matches any character OK in a URI path --> > <xsl:matching-substring> > <xsl:value-of select="."/> > </xsl:matching-substring> > <!-- characters that aren't OK get encoded --> > <xsl:non-matching-substring> > <xsl:value-of select="encode-for-uri(.)"/> > </xsl:non-matching-substring> > </xsl:analyze-string> > <xsl:value-of select="regex-group(3)"/> > </xsl:matching-substring> > </xsl:analyze-string> > </xsl:attribute> > </xsl:template> > > I hope that's helpful! > > -- > Conal Tuohy > http://conaltuohy.com/ > @conal_tuohy > +61-466-324297 >
