Re: Getting rid of triples with bad URIs

Andy Seaborne Thu, 27 Oct 2016 02:21:48 -0700


On 27/10/16 08:06, Osma Suominen wrote:

Hi Andy!


You're right - these problems should be fixed, preferably at the source
(in my case, the bad MARC records). And I will try to do that. But I'm
setting up a conversion pipeline [1]

Shouldn't the conversion to triples check the URIs for validity? Atleast the N-Triples grammar rule:


>> [8]     IRIREF     ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

That rule was chosen (by EricP) as a balance between full and expensiveURI checking and some degree of correctness with a regex or simplescanning check.

to be run periodically, and I want
that to be robust, so that small errors like this do not cause big
problems later on. Even if I fix the current problems, one day someone
will introduce a new bad URI into a MARC record. It is better to simply
drop a single bad triple instead of losing 50k triples from the same batch.

Having bad URI in the database is, in my experience, a big problem.They are hard to find later and fix once it is in a database (best way Iknow - dump the database to N-Quads and fix the text). Usually, thefirst report is when users of the system report issues some time later.


What does your pipe do about IRI warnings? Or other broken URIs?

I was surprised that riot didn't help here, particularly since it has
the --nocheck option, and --stop is not the default mode of operation.


That's open source for you.

I could use unix tools like grep, awk and/or sed to check for bad URIs
and fix or filter them on the fly, but it's nontrivial

It is one line to grep for spaces in URIs with the bonus you can writethose lines to a separate file for accurate reporting of problems.

It does not need to be an "either/or" - one stage of the pipeline checksthe data (there are other useful checks like all lines end in a DOT),then parse it to get other checking. All checking does not have to bebundled into one stage.

- I might miss an
edge case somewhere. I thought it would be better if I could use the
same tool that already validates URIs/IRIs to also reject the bad triples.
What is --nocheck in riot supposed to do, if it has no effect in this case?

Unfortunately, this is a low level syntax (tokenization) issue. I willput in some code that can be used to change this one case (I'll preparethe PR in a few minutes; the code exists because I did some maintenanceinvestigating this yesterday) but you'll encounter more other problems.


* <http://example/<<<<>>>>>>

* Bad unicode sequences. Quite nasty as reporting the line number is badif java conversion to unicode is done. JavaCC has this problem as well.

* Stray newlines: literals and URIs.
   <http://example/abc
def> .

"I forgot the
triple quotes"

and these are harder to have any recovery policy for. There is a realperformance/functionality tradeoff here. To be able to skip bad data(error recovery) is at odds with fast tokenizing and input caching.

The --skip option seems to be half-implemented, do you (or anyone else)
know why?


I am a lazy good for-nothing programmer.

I can try to patch up the code if it's obvious what should be done.
Right now I'm a bit confused about how the options are supposed to work
and whethere there's a bug somewhere, or just a missing feature.

The best approach is to add a new parser for N-triples (which is not atall hard - N-Triples is so simple) which can do recovery, reporting andsplitting the output between good and bad. The current parser can'toutput to different places. It should be easy to register it as areplacement for the standard one.


What is important is that the current N-triples parsing rate is critical.

The hot spots are:

* bytes to character conversion

It uses large buffering, a stripped down buffer reader and JDKInputStreamReader.


* Tokenization

TokenizerText is written "C style" as much as possible so that the JITcan optimize it.


* The N-Triples parse loop (which is less than 25 lines!)

If you can reuse the tokenizer, a LangNTriplesSkipping
(see PR for tokenizer changes)

Even the difference between parsing the same data with Turtle and withN-triples is quite pronounced in the parsing speed. This seems common inall RDF toolkits. Some kind of CU cache effect.


N-triples is the fastest text parser - I get 200K+ triples/s [*]

Doing a recovering parser for Turtle is harder because there are kindsof syntax mistakes. "Skip to DOT" would be less reliable and oftenskip a more then one triple.


        Andy

[*] It is not the fastest parser - RDF/Thrift which is a binary format -and I get 500k+ pure paring speed. The tokenizer for text formats(NT.,TTL, NQ, TriG) seems to top out at 1e6 tokens/s so it is not goingto reach 500K triples/s. The expensive part is scanning for delimiters.Binary formats use length-encoded strings.

-Osma


On 26/10/16 14:50, Andy Seaborne wrote:

Hi Osma,

I usually treat this an an ETL cleaning problem and text-process - it's
not just finding the duff URIs but fixing them in some way.

We could change the parser behaviour for bad URIs.  There is a reason
why it is picky though - if bad data gets into a database it is very
hard to fix it up afterwards.  Often, problems arise days/weeks/months
later and may be in the interaction with other systems when query
results published.

Turtle and N-triples explicitly define a token rule (N-triples):

[8]     IRIREF     ::=     '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

whereby soace is rules out at the bottom-most level of the parsing
process.

JSON-LD is 3rd party system : jsonld-java.

Looks to me like Jena is not checking the output from that as it creates
the Jena objects because "ParserProfileChecker" is checking for triple
problems (literals as subjects etc) and assumes it's input terms are
valid.

     Andy


On 25/10/16 13:05, Osma Suominen wrote:

Hi,

I'm trying to post-process a large bibliographic data set which, among
its 30M or so triples split into 300 N-Triples files, contains a few bad
URIs. Because of the bad URIs, I run into problems when trying to use
the data, e.g. to load it into TDB or SDB. The data set is created from
MARC records using a XQuery-based conversion process [1] that isn't very
careful with URIs, so bad URIs or other errors in the original records
may be passed through and will be present in the output files.

What I'd like to do is to merge the 300 files into a single N-Triples
file, without including the triples with the bad URIs, using e.g. riot
from the command line, like this:

riot input*.nt >output.nt

But the bad URIs in the input files cause parsing errors and subsequent
triples in the same file will not be included in the output.

Here is a small example file, with a bad URI on the 2nd line:
--cut--
<http://example.org/007334701> <http://schema.org/name> "example bad
URL" .
<http://example.org/007334701> <http://schema.org/url>
<http://example.org/007334701.pdf |q PDF> .
<http://example.org/007334701> <http://schema.org/description> "an
example with a bad URL" .
--cut--

When parsed using the above riot command, I get this output:

14:47:45 ERROR riot                 :: [line: 2, col: 90] Bad character
in IRI (space): <http://example.org/007334701.pdf[space]...>
<http://example.org/007334701> <http://schema.org/name> "example bad
URL" .

So the command outputs just the first triple (i.e. anything before the
bad URI), but omits the bad one as well as the last one which came after
the bad URI. If I have a file with 100000 triples with one having a bad
URI on line 50000, the last 50000 triples in that file are discarded.

I tried the --nocheck option but it didn't seem to make any difference,
the result is exactly the same.

Also there is the --stop option, but it would do the opposite of what I
want - I don't want to stop on the first error, but instead continue
with the parsing.

I see that ModLangParse, the class used to process command line options
in riot, has some initial support for a --skip option [2] that would
probably do what I want, i.e. omit the bad triples while preserving all
the valid ones. But that option handling code is commented out and
CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field
that would be set based on that option) [3].

So how can I get rid of the few bad triples in my input files while
preserving all the good ones?

I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.

Thanks,
Osma


[1] https://github.com/lcnetdev/marc2bibframe

[2]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78




[3]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224

Re: Getting rid of triples with bad URIs

Reply via email to