On 20 January 2018 at 18:37, Jean-Marc Vanel <[email protected]>
wrote:

> 2018-01-20 0:15 GMT+01:00 Andy Seaborne <[email protected]>:
>
> > Hi,
> >
> > Minimal, example file?
> >
>
> ?xml version="1.0" encoding="UTF-8"?>
> <rdf:RDF xmlns:foaf="http://xmlns.com/foaf/0.1/"; >
>     <foaf:Organization
>         rdf:about="
> https://www.communecter.org/#organization.detail.id.
> 5898612440bb4e7d28cfc81a"
> >
>       <foaf:homepage rdf:resource="*http://[email protected]
> <http://[email protected]>*"/>
>   </foaf:Organization>
> </rdf:RDF>
>


>
> > Passing the input through a text processing stage (perl, sed ...) is
> > probably the better way - fix up the errors.
> >
>
> Sure, but I'm at the end of data flow: a crowd sourcing site gathers
> (variable) quality data, then a developer converts several such sites in a
> unique XML format, then me applying XSLT for RDF. So upstream it's curated,
> and I report everything I find . And bad IRI's do not prevent the RDF to be
> loaded in TDB .
>
>
If you are generating the RDF/XML using XSLT, may I suggest you try
to clean up the URIs in the XSLT? If you are using XSLT version 2 or newer,
then you can even use xsl:analyze-string to check URIs with a regex, but
even in XSLT 1 it should not be hard. Then you can repair (or log) errors
like the one in your example, as well as ensuring that host names are in
lower case, characters are correctly URI-encoded, etc.

For example, here's an XSLT template I've used to repair incorrect
URI-encoding in some URIs prior to ingestion as RDF:

<xsl:template match="@href">
<xsl:attribute name="href">
<xsl:analyze-string select="." regex="(https?://[^\?/]*)?([^?#]*)(.*)">
<xsl:matching-substring>
<!-- regex-group(1) = scheme and host -->
<!-- regex-group(2) = path -->
<!-- regex-group(3) = query and fragment id -->
<xsl:value-of select="regex-group(1)"/>
<xsl:analyze-string select="regex-group(2)" regex="[/a-zA-Z0-9\-\._~]">
<!-- matches any character OK in a URI path -->
<xsl:matching-substring>
<xsl:value-of select="."/>
</xsl:matching-substring>
<!-- characters that aren't OK get encoded -->
<xsl:non-matching-substring>
<xsl:value-of select="encode-for-uri(.)"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
<xsl:value-of select="regex-group(3)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:attribute>
</xsl:template>

I hope that's helpful!

-- 
Conal Tuohy
http://conaltuohy.com/
@conal_tuohy
+61-466-324297

Reply via email to