good afternoon; > On 2016-10-27, at 11:46, Osma Suominen <[email protected]> wrote: > > Hi Andy! > > On 27/10/16 12:21, Andy Seaborne wrote: >> Shouldn't the conversion to triples check the URIs for validity? At >> least the N-Triples grammar rule: >> >> >> [8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>' >> >> That rule was chosen (by EricP) as a balance between full and expensive >> URI checking and some degree of correctness with a regex or simple >> scanning check. > > ... > > No really, I'm trying to understand the issue so that I can propose or even > fix things myself. > ... > Okay. I will think about this. But most likely I'll just use a separate regex > validation/filtering step outside Jena.
as andy noted, the iriref syntax is well suited to regex use and sed is well capable of applying it to effect for line-oriented content. if the statement constituency does not matter (that is, no literal subjects), then that should be true for any text encoding. taken as the canonical criteria, it eliminates misgivings about "[missing] an edge case somewhere” while offering the advantage that, if this is not a casual application, - you have a low implementation threshold for tooling which will operate on effectively unlimited datasets, - it can produce diffs to allow one to record and report on deficiencies in the initial data, - repeat the result with less resource expenditure, and even - reflect on the transformation in order to produce purposeful corrections in order to improve the result quality. while all of these would be possible to achieve through a process which is integrated into the parser, the effort would likely be greater. if this is an ongoing production case, an independent transformation stage could even put you in position to reconcile later documents through other channels - eg sparql queries, which otherwise could end up divorced from their intended target terms. best regards, from berlin, --- james anderson | [email protected] | http://dydra.com
