AW: Getting rid of triples with bad URIs

Neubert, Joachim Tue, 25 Oct 2016 11:11:04 -0700

Hi Osma,

What a coincidence: Today I ran into the same problem here. I've (many large) 
jsonld files with a few messy URIs like this:


{
   "@context" :
      {
        "dcterms": "http://purl.org/dc/terms/";,
        "eb": "http://zbw.eu/beta/resource/title/";,
        "gnd": "http://d-nb.info/gnd/";,
        "subject_gnd": { "@id": "dcterms:subject", "@type": "@id" }
      },
   "@graph" : [
      {
         "subject_gnd" : [
            "gnd:4114557-4 4070699-0",
            "gnd:4114247-0"
         ],
         "@id" : "eb:10010237512"
      }
   ]
}

riot produces a warning and 2 triples:

# riot --check --strict /tmp/example.jsonld
20:00:26 WARN  riot                 :: Bad IRI: <http://d-nb.info/gnd/4114557-4 
4070699-0> Code: 17/WHITESPACE in PATH: A single whitespace character. These 
match no grammar rules of URIs/IRIs. These characters are permitted in RDF URI 
References, XML system identifiers, and XML Schema anyURIs.
<http://zbw.eu/beta/resource/title/10010237512> 
<http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4114557-4 4070699-0> .
<http://zbw.eu/beta/resource/title/10010237512> 
<http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4114247-0> .

Strangely I can load the .jsonld file via tdbloader (resulting in the very two 
triples shown above in the tdb). Loading the equivalent .nt file aborts with an 
exception:

ERROR [line: 1, col: 116] Bad character in IRI (space): 
<http://d-nb.info/gnd/4114557-4[space]...>
org.apache.jena.riot.RiotException: [line: 1, col: 116] Bad character in IRI 
(space): <http://d-nb.info/gnd/4114557-4[space]...>

Neither of these behaviors is very helpful. Some --skip option which 
consistently skips the bad triples and outputs or loads the good ones would be 
great. Or perhaps somebody has another idea how to get rid of the bad URIs?

Cheers, Joachim


> -----Ursprüngliche Nachricht-----
> Von: Osma Suominen [mailto:[email protected]]
> Gesendet: Dienstag, 25. Oktober 2016 14:05
> An: [email protected]
> Betreff: Getting rid of triples with bad URIs
> 
> Hi,
> 
> I'm trying to post-process a large bibliographic data set which, among its 30M
> or so triples split into 300 N-Triples files, contains a few bad URIs. 
> Because of
> the bad URIs, I run into problems when trying to use the data, e.g. to load 
> it into
> TDB or SDB. The data set is created from MARC records using a XQuery-based
> conversion process [1] that isn't very careful with URIs, so bad URIs or other
> errors in the original records may be passed through and will be present in 
> the
> output files.
> 
> What I'd like to do is to merge the 300 files into a single N-Triples file, 
> without
> including the triples with the bad URIs, using e.g. riot from the command 
> line,
> like this:
> 
> riot input*.nt >output.nt
> 
> But the bad URIs in the input files cause parsing errors and subsequent 
> triples
> in the same file will not be included in the output.
> 
> Here is a small example file, with a bad URI on the 2nd line:
> --cut--
> <http://example.org/007334701> <http://schema.org/name> "example bad
> URL" .
> <http://example.org/007334701> <http://schema.org/url>
> <http://example.org/007334701.pdf |q PDF> .
> <http://example.org/007334701> <http://schema.org/description> "an
> example with a bad URL" .
> --cut--
> 
> When parsed using the above riot command, I get this output:
> 
> 14:47:45 ERROR riot                 :: [line: 2, col: 90] Bad character
> in IRI (space): <http://example.org/007334701.pdf[space]...>
> <http://example.org/007334701> <http://schema.org/name> "example bad
> URL" .
> 
> So the command outputs just the first triple (i.e. anything before the bad 
> URI),
> but omits the bad one as well as the last one which came after the bad URI. 
> If I
> have a file with 100000 triples with one having a bad URI on line 50000, the 
> last
> 50000 triples in that file are discarded.
> 
> I tried the --nocheck option but it didn't seem to make any difference, the 
> result
> is exactly the same.
> 
> Also there is the --stop option, but it would do the opposite of what I want 
> - I
> don't want to stop on the first error, but instead continue with the parsing.
> 
> I see that ModLangParse, the class used to process command line options in
> riot, has some initial support for a --skip option [2] that would probably do 
> what
> I want, i.e. omit the bad triples while preserving all the valid ones. But 
> that
> option handling code is commented out and CmdLangParse doesn't do anything
> with skipOnBadTerm (the boolean field that would be set based on that option)
> [3].
> 
> So how can I get rid of the few bad triples in my input files while 
> preserving all
> the good ones?
> 
> I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.
> 
> Thanks,
> Osma
> 
> 
> [1] https://github.com/lcnetdev/marc2bibframe
> 
> [2]
> https://github.com/apache/jena/blob/master/jena-
> cmds/src/main/java/arq/cmdline/ModLangParse.java#L78
> 
> [3]
> https://github.com/apache/jena/blob/master/jena-
> cmds/src/main/java/riotcmd/CmdLangParse.java#L224
> 
> --
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. 
> Box
> 26 (Kaikukatu 4)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> [email protected]
> http://www.nationallibrary.fi

AW: Getting rid of triples with bad URIs

Reply via email to