Re: Getting rid of triples with bad URIs

Osma Suominen Thu, 27 Oct 2016 02:47:00 -0700

Hi Andy!

On 27/10/16 12:21, Andy Seaborne wrote:

Shouldn't the conversion to triples check the URIs for validity?  At
least the N-Triples grammar rule:


 >> [8]     IRIREF     ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

That rule was chosen (by EricP) as a balance between full and expensive
URI checking and some degree of correctness with a regex or simple
scanning check.

Probably it should, but it's a converter developed by the Library ofCongress (https://github.com/lcnetdev/marc2bibframe) and the XQueriesare quite big beasts already. It's not being maintained anymore and I'mreluctant to change it on my own. Instead I try to work around anyissues by pre- and post-processing my data.

Having bad URI in the database is, in my experience, a big problem. They
are hard to find later and fix once it is in a database (best way I know
- dump the database to N-Quads and fix the text).  Usually, the first
report is when users of the system report issues some time later.

Yes, I'm not planning to put the bad URIs in a database. Instead I tryto get rid of them as soon as possible - either eliminating them at thesource, or failing that, right after the conversion to RDF.

What does your pipe do about IRI warnings? Or other broken URIs?

Most URIs in the data are generated in the conversion process itself,using only alphanumeric characters etc. So the problem is really only ahandful of URIs (Web document URLs generally) that were incorrectlyentered into the MARC records.

That's open source for you.


Right - you get to keep both pieces when it breaks. :)

No really, I'm trying to understand the issue so that I can propose oreven fix things myself.

It is one line to grep for spaces in URIs with the bonus you can write
those lines to a separate file for accurate reporting of problems.

Right. I had this in mind. Except it is not enough to check for spaces,since there are other kinds of bad URIs as well - I recall seeing atleast unescaped braces in there. But the IRIREF regex is a good startingpoint, sure.

It does not need to be an "either/or" - one stage of the pipeline checks
the data (there are other useful checks like all lines end in a DOT),
then parse it to get other checking.  All checking does not have to be
bundled into one stage.

Yes, I'm just trying to make this as efficient as possible, withinreason. But definitely this can be broken up and make a separatevalidation step.

Unfortunately, this is a low level syntax (tokenization) issue.  I will
put in some code that can be used to change this one case (I'll prepare
the PR in a few minutes; the code exists because I did some maintenance
investigating this yesterday) but you'll encounter more other problems.

* <http://example/<<<<>>>>>>
* Bad unicode sequences. Quite nasty as reporting the line number is bad
if java conversion to unicode is done.  JavaCC has this problem as well.
* Stray newlines: literals and URIs.
    <http://example/abc
def> .

"I forgot the
triple quotes"

and these are harder to have any recovery policy for. There is a real
performance/functionality tradeoff here.  To be able to skip bad data
(error recovery) is at odds with fast tokenizing and input caching.


Very good examples!

The --skip option seems to be half-implemented, do you (or anyone else)
know why?


I am a lazy good for-nothing programmer.

Oh really :) I think the above already explains why this hasn't beenimplemented. I just happened to notice that someone had at least thoughtof a --skip option, even though it wasn't really implemented, that's whyI asked.

The best approach is to add a new parser for N-triples (which is not at
all hard - N-Triples is so simple) which can do recovery, reporting and
splitting the output between good and bad.  The current parser can't
output to different places.  It should be easy to register it as a
replacement for the standard one.

Okay. I will think about this. But most likely I'll just use a separateregex validation/filtering step outside Jena.


-Osma


--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: Getting rid of triples with bad URIs

Reply via email to