Hi all,
I wrote a little Python script to do the N-triples parsing/validation
using a regex as suggested:
https://github.com/NatLibFi/bib-rdf-pipeline/blob/master/scripts/filter-bad-ntriples.py
It doesn't check for absolutely everything (e.g. formatting language
tags or datatypes) but it's enough for what I need right now.
The reason I didn't use grep was that I want to both pass through the
valid triples (on stdout) and report the bad ones (on stderr). Maybe sed
could do it too, but this was easier for me.
Thanks for all the advice!
-Osma
PS. I also found a tool called "reshaperdf" which has a "correct"
command that does a very similar operation - fixing some bad triples and
reporting others. It only checks for spaces in URIs but not e.g. braces,
so it wasn't useful to me without modifications.
https://github.com/linked-swissbib/reshaperdf/blob/master/src/main/java/org/gesis/reshaperdf/cmd/correct/CorrectCommand.java
On 27/10/16 13:24, james anderson wrote:
good afternoon;
On 2016-10-27, at 11:46, Osma Suominen <[email protected]> wrote:
Hi Andy!
On 27/10/16 12:21, Andy Seaborne wrote:
Shouldn't the conversion to triples check the URIs for validity? At
least the N-Triples grammar rule:
[8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
That rule was chosen (by EricP) as a balance between full and expensive
URI checking and some degree of correctness with a regex or simple
scanning check.
...
No really, I'm trying to understand the issue so that I can propose or even fix
things myself.
...
Okay. I will think about this. But most likely I'll just use a separate regex
validation/filtering step outside Jena.
as andy noted, the iriref syntax is well suited to regex use and sed is well
capable of applying it to effect for line-oriented content.
if the statement constituency does not matter (that is, no literal subjects),
then that should be true for any text encoding.
taken as the canonical criteria, it eliminates misgivings about "[missing] an
edge case somewhere” while offering the advantage that, if this is not a casual
application,
- you have a low implementation threshold for tooling which will operate on
effectively unlimited datasets,
- it can produce diffs to allow one to record and report on deficiencies in the
initial data,
- repeat the result with less resource expenditure, and even
- reflect on the transformation in order to produce purposeful corrections in
order to improve the result quality.
while all of these would be possible to achieve through a process which is
integrated into the parser, the effort would likely be greater.
if this is an ongoing production case, an independent transformation stage
could even put you in position to reconcile later documents through other
channels - eg sparql queries, which otherwise could end up divorced from their
intended target terms.
best regards, from berlin,
---
james anderson | [email protected] | http://dydra.com
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi