Re: Getting rid of triples with bad URIs

Osma Suominen Mon, 31 Oct 2016 10:05:11 -0700

Hi all,

I wrote a little Python script to do the N-triples parsing/validationusing a regex as suggested:

https://github.com/NatLibFi/bib-rdf-pipeline/blob/master/scripts/filter-bad-ntriples.py

It doesn't check for absolutely everything (e.g. formatting languagetags or datatypes) but it's enough for what I need right now.

The reason I didn't use grep was that I want to both pass through thevalid triples (on stdout) and report the bad ones (on stderr). Maybe sedcould do it too, but this was easier for me.


Thanks for all the advice!

-Osma

PS. I also found a tool called "reshaperdf" which has a "correct"command that does a very similar operation - fixing some bad triples andreporting others. It only checks for spaces in URIs but not e.g. braces,so it wasn't useful to me without modifications.

https://github.com/linked-swissbib/reshaperdf/blob/master/src/main/java/org/gesis/reshaperdf/cmd/correct/CorrectCommand.java

On 27/10/16 13:24, james anderson wrote:

good afternoon;

On 2016-10-27, at 11:46, Osma Suominen <[email protected]> wrote:

Hi Andy!

On 27/10/16 12:21, Andy Seaborne wrote:

Shouldn't the conversion to triples check the URIs for validity?  At
least the N-Triples grammar rule:

[8]     IRIREF     ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'


That rule was chosen (by EricP) as a balance between full and expensive
URI checking and some degree of correctness with a regex or simple
scanning check.


...

No really, I'm trying to understand the issue so that I can propose or even fix 
things myself.
...
Okay. I will think about this. But most likely I'll just use a separate regex 
validation/filtering step outside Jena.


as andy noted, the iriref syntax is well suited to regex use and sed is well 
capable of applying it to effect for line-oriented content.
if the statement constituency does not matter (that is, no literal subjects), 
then that should be true for any text encoding.

taken as the canonical criteria, it eliminates misgivings about "[missing] an 
edge case somewhere” while offering the advantage that, if this is not a casual 
application,
- you have a low implementation threshold for tooling which will operate on 
effectively unlimited datasets,
- it can produce diffs to allow one to record and report on deficiencies in the 
initial data,
- repeat the result with less resource expenditure, and even
- reflect on the transformation in order to produce purposeful corrections in 
order to improve the result quality.

while all of these would be possible to achieve through a process which is 
integrated into the parser, the effort would likely be greater.
if this is an ongoing production case, an independent transformation stage 
could even put you in position to reconcile later documents through other 
channels - eg sparql queries, which otherwise could end up divorced from their 
intended target terms.

best regards, from berlin,
---
james anderson | [email protected] | http://dydra.com



--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: Getting rid of triples with bad URIs

Reply via email to