Re: Getting rid of triples with bad URIs

james anderson Thu, 27 Oct 2016 13:06:25 -0700

good afternoon;

> On 2016-10-27, at 11:46, Osma Suominen <[email protected]> wrote:
> 
> Hi Andy!
> 
> On 27/10/16 12:21, Andy Seaborne wrote:
>> Shouldn't the conversion to triples check the URIs for validity?  At
>> least the N-Triples grammar rule:
>> 
>> >> [8]     IRIREF     ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
>> 
>> That rule was chosen (by EricP) as a balance between full and expensive
>> URI checking and some degree of correctness with a regex or simple
>> scanning check.
> 
> ...
> 
> No really, I'm trying to understand the issue so that I can propose or even 
> fix things myself.
> ...
> Okay. I will think about this. But most likely I'll just use a separate regex 
> validation/filtering step outside Jena.


as andy noted, the iriref syntax is well suited to regex use and sed is well 
capable of applying it to effect for line-oriented content.
if the statement constituency does not matter (that is, no literal subjects), 
then that should be true for any text encoding.

taken as the canonical criteria, it eliminates misgivings about "[missing] an 
edge case somewhere” while offering the advantage that, if this is not a casual 
application,
- you have a low implementation threshold for tooling which will operate on 
effectively unlimited datasets,
- it can produce diffs to allow one to record and report on deficiencies in the 
initial data,
- repeat the result with less resource expenditure, and even
- reflect on the transformation in order to produce purposeful corrections in 
order to improve the result quality.

while all of these would be possible to achieve through a process which is 
integrated into the parser, the effort would likely be greater.
if this is an ongoing production case, an independent transformation stage 
could even put you in position to reconcile later documents through other 
channels - eg sparql queries, which otherwise could end up divorced from their 
intended target terms.

best regards, from berlin,
---
james anderson | [email protected] | http://dydra.com

Re: Getting rid of triples with bad URIs

Reply via email to