Re: Getting rid of triples with bad URIs

Rob Vesse Thu, 27 Oct 2016 02:21:08 -0700

Skipping bad data in parsers tends to be a non-trivial problem particularly 
with more complex formats. Most parsers whether hand written or generated 
obvious on tokenising input stream into discrete recognisable tokens using the 
grammar rules to decide what kind of token is expected next. In the event that 
you hit a bad token you then need to recover somehow. In practice this usually 
means discarding tokens and/or input until you reach a point where you can 
safely restart parsing. For N-Triples this is relatively easy since you can 
simply read to the nextnew line.


However, many other formats what difficult to impossible to successfully 
recover from errors, particularly in the case of formats with global state e.g. 
Prefix mappings because if you skip over a section of invalid Data that would 
have changed the global State your interpretation of the rest of the data might 
be completely incorrect.

Rob

On 27/10/2016 08:06, "Osma Suominen" <[email protected]> wrote:

    Hi Andy!
    
    You're right - these problems should be fixed, preferably at the source 
    (in my case, the bad MARC records). And I will try to do that. But I'm 
    setting up a conversion pipeline [1] to be run periodically, and I want 
    that to be robust, so that small errors like this do not cause big 
    problems later on. Even if I fix the current problems, one day someone 
    will introduce a new bad URI into a MARC record. It is better to simply 
    drop a single bad triple instead of losing 50k triples from the same batch.
    
    I was surprised that riot didn't help here, particularly since it has 
    the --nocheck option, and --stop is not the default mode of operation.
    
    I could use unix tools like grep, awk and/or sed to check for bad URIs 
    and fix or filter them on the fly, but it's nontrivial - I might miss an 
    edge case somewhere. I thought it would be better if I could use the 
    same tool that already validates URIs/IRIs to also reject the bad triples.
    
    What is --nocheck in riot supposed to do, if it has no effect in this case?
    
    The --skip option seems to be half-implemented, do you (or anyone else) 
    know why?
    
    I can try to patch up the code if it's obvious what should be done. 
    Right now I'm a bit confused about how the options are supposed to work 
    and whethere there's a bug somewhere, or just a missing feature.
    
    -Osma
    
    
    On 26/10/16 14:50, Andy Seaborne wrote:
    > Hi Osma,
    >
    > I usually treat this an an ETL cleaning problem and text-process - it's
    > not just finding the duff URIs but fixing them in some way.
    >
    > We could change the parser behaviour for bad URIs.  There is a reason
    > why it is picky though - if bad data gets into a database it is very
    > hard to fix it up afterwards.  Often, problems arise days/weeks/months
    > later and may be in the interaction with other systems when query
    > results published.
    >
    > Turtle and N-triples explicitly define a token rule (N-triples):
    >
    > [8]     IRIREF     ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
    >
    > whereby soace is rules out at the bottom-most level of the parsing 
process.
    >
    > JSON-LD is 3rd party system : jsonld-java.
    >
    > Looks to me like Jena is not checking the output from that as it creates
    > the Jena objects because "ParserProfileChecker" is checking for triple
    > problems (literals as subjects etc) and assumes it's input terms are 
valid.
    >
    >      Andy
    >
    >
    > On 25/10/16 13:05, Osma Suominen wrote:
    >> Hi,
    >>
    >> I'm trying to post-process a large bibliographic data set which, among
    >> its 30M or so triples split into 300 N-Triples files, contains a few bad
    >> URIs. Because of the bad URIs, I run into problems when trying to use
    >> the data, e.g. to load it into TDB or SDB. The data set is created from
    >> MARC records using a XQuery-based conversion process [1] that isn't very
    >> careful with URIs, so bad URIs or other errors in the original records
    >> may be passed through and will be present in the output files.
    >>
    >> What I'd like to do is to merge the 300 files into a single N-Triples
    >> file, without including the triples with the bad URIs, using e.g. riot
    >> from the command line, like this:
    >>
    >> riot input*.nt >output.nt
    >>
    >> But the bad URIs in the input files cause parsing errors and subsequent
    >> triples in the same file will not be included in the output.
    >>
    >> Here is a small example file, with a bad URI on the 2nd line:
    >> --cut--
    >> <http://example.org/007334701> <http://schema.org/name> "example bad
    >> URL" .
    >> <http://example.org/007334701> <http://schema.org/url>
    >> <http://example.org/007334701.pdf |q PDF> .
    >> <http://example.org/007334701> <http://schema.org/description> "an
    >> example with a bad URL" .
    >> --cut--
    >>
    >> When parsed using the above riot command, I get this output:
    >>
    >> 14:47:45 ERROR riot                 :: [line: 2, col: 90] Bad character
    >> in IRI (space): <http://example.org/007334701.pdf[space]...>
    >> <http://example.org/007334701> <http://schema.org/name> "example bad
    >> URL" .
    >>
    >> So the command outputs just the first triple (i.e. anything before the
    >> bad URI), but omits the bad one as well as the last one which came after
    >> the bad URI. If I have a file with 100000 triples with one having a bad
    >> URI on line 50000, the last 50000 triples in that file are discarded.
    >>
    >> I tried the --nocheck option but it didn't seem to make any difference,
    >> the result is exactly the same.
    >>
    >> Also there is the --stop option, but it would do the opposite of what I
    >> want - I don't want to stop on the first error, but instead continue
    >> with the parsing.
    >>
    >> I see that ModLangParse, the class used to process command line options
    >> in riot, has some initial support for a --skip option [2] that would
    >> probably do what I want, i.e. omit the bad triples while preserving all
    >> the valid ones. But that option handling code is commented out and
    >> CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field
    >> that would be set based on that option) [3].
    >>
    >> So how can I get rid of the few bad triples in my input files while
    >> preserving all the good ones?
    >>
    >> I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.
    >>
    >> Thanks,
    >> Osma
    >>
    >>
    >> [1] https://github.com/lcnetdev/marc2bibframe
    >>
    >> [2]
    >> 
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78
    >>
    >>
    >>
    >> [3]
    >> 
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224
    >>
    >>
    >>
    
    
    -- 
    Osma Suominen
    D.Sc. (Tech), Information Systems Specialist
    National Library of Finland
    P.O. Box 26 (Kaikukatu 4)
    00014 HELSINGIN YLIOPISTO
    Tel. +358 50 3199529
    [email protected]
    http://www.nationallibrary.fi

Re: Getting rid of triples with bad URIs

Reply via email to