Skipping bad data in parsers tends to be a non-trivial problem particularly with more complex formats. Most parsers whether hand written or generated obvious on tokenising input stream into discrete recognisable tokens using the grammar rules to decide what kind of token is expected next. In the event that you hit a bad token you then need to recover somehow. In practice this usually means discarding tokens and/or input until you reach a point where you can safely restart parsing. For N-Triples this is relatively easy since you can simply read to the nextnew line.
However, many other formats what difficult to impossible to successfully recover from errors, particularly in the case of formats with global state e.g. Prefix mappings because if you skip over a section of invalid Data that would have changed the global State your interpretation of the rest of the data might be completely incorrect. Rob On 27/10/2016 08:06, "Osma Suominen" <[email protected]> wrote: Hi Andy! You're right - these problems should be fixed, preferably at the source (in my case, the bad MARC records). And I will try to do that. But I'm setting up a conversion pipeline [1] to be run periodically, and I want that to be robust, so that small errors like this do not cause big problems later on. Even if I fix the current problems, one day someone will introduce a new bad URI into a MARC record. It is better to simply drop a single bad triple instead of losing 50k triples from the same batch. I was surprised that riot didn't help here, particularly since it has the --nocheck option, and --stop is not the default mode of operation. I could use unix tools like grep, awk and/or sed to check for bad URIs and fix or filter them on the fly, but it's nontrivial - I might miss an edge case somewhere. I thought it would be better if I could use the same tool that already validates URIs/IRIs to also reject the bad triples. What is --nocheck in riot supposed to do, if it has no effect in this case? The --skip option seems to be half-implemented, do you (or anyone else) know why? I can try to patch up the code if it's obvious what should be done. Right now I'm a bit confused about how the options are supposed to work and whethere there's a bug somewhere, or just a missing feature. -Osma On 26/10/16 14:50, Andy Seaborne wrote: > Hi Osma, > > I usually treat this an an ETL cleaning problem and text-process - it's > not just finding the duff URIs but fixing them in some way. > > We could change the parser behaviour for bad URIs. There is a reason > why it is picky though - if bad data gets into a database it is very > hard to fix it up afterwards. Often, problems arise days/weeks/months > later and may be in the interaction with other systems when query > results published. > > Turtle and N-triples explicitly define a token rule (N-triples): > > [8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>' > > whereby soace is rules out at the bottom-most level of the parsing process. > > JSON-LD is 3rd party system : jsonld-java. > > Looks to me like Jena is not checking the output from that as it creates > the Jena objects because "ParserProfileChecker" is checking for triple > problems (literals as subjects etc) and assumes it's input terms are valid. > > Andy > > > On 25/10/16 13:05, Osma Suominen wrote: >> Hi, >> >> I'm trying to post-process a large bibliographic data set which, among >> its 30M or so triples split into 300 N-Triples files, contains a few bad >> URIs. Because of the bad URIs, I run into problems when trying to use >> the data, e.g. to load it into TDB or SDB. The data set is created from >> MARC records using a XQuery-based conversion process [1] that isn't very >> careful with URIs, so bad URIs or other errors in the original records >> may be passed through and will be present in the output files. >> >> What I'd like to do is to merge the 300 files into a single N-Triples >> file, without including the triples with the bad URIs, using e.g. riot >> from the command line, like this: >> >> riot input*.nt >output.nt >> >> But the bad URIs in the input files cause parsing errors and subsequent >> triples in the same file will not be included in the output. >> >> Here is a small example file, with a bad URI on the 2nd line: >> --cut-- >> <http://example.org/007334701> <http://schema.org/name> "example bad >> URL" . >> <http://example.org/007334701> <http://schema.org/url> >> <http://example.org/007334701.pdf |q PDF> . >> <http://example.org/007334701> <http://schema.org/description> "an >> example with a bad URL" . >> --cut-- >> >> When parsed using the above riot command, I get this output: >> >> 14:47:45 ERROR riot :: [line: 2, col: 90] Bad character >> in IRI (space): <http://example.org/007334701.pdf[space]...> >> <http://example.org/007334701> <http://schema.org/name> "example bad >> URL" . >> >> So the command outputs just the first triple (i.e. anything before the >> bad URI), but omits the bad one as well as the last one which came after >> the bad URI. If I have a file with 100000 triples with one having a bad >> URI on line 50000, the last 50000 triples in that file are discarded. >> >> I tried the --nocheck option but it didn't seem to make any difference, >> the result is exactly the same. >> >> Also there is the --stop option, but it would do the opposite of what I >> want - I don't want to stop on the first error, but instead continue >> with the parsing. >> >> I see that ModLangParse, the class used to process command line options >> in riot, has some initial support for a --skip option [2] that would >> probably do what I want, i.e. omit the bad triples while preserving all >> the valid ones. But that option handling code is commented out and >> CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field >> that would be set based on that option) [3]. >> >> So how can I get rid of the few bad triples in my input files while >> preserving all the good ones? >> >> I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24. >> >> Thanks, >> Osma >> >> >> [1] https://github.com/lcnetdev/marc2bibframe >> >> [2] >> https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78 >> >> >> >> [3] >> https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224 >> >> >> -- Osma Suominen D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. Box 26 (Kaikukatu 4) 00014 HELSINGIN YLIOPISTO Tel. +358 50 3199529 [email protected] http://www.nationallibrary.fi
