Am 13.06.2012 21:58, schrieb Rob Vesse:
Yes I think it is definitely tricky, I would probably suggest having a
separate tokenizer implementation rather than trying to add this
functionality onto the existing one. That way there is scope for adding
more complex error recovery at a later date and you don't affect current
performance.
Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype:
rvesse
6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588
On 6/13/12 12:52 PM, "Andy Seaborne"<[email protected]> wrote:
On 13/06/12 17:52, Rob Vesse wrote:
Actually it looks like some of this stuff is already in place. If you
take a look at LangNTriples in ARQ you will see it derives from
LangNTuples which has a setSkipOnBadTerms() method but I can't tell
whether this actually affects anything I.e. Whether it is actually
honored
by LangNTriples but you may want to experiment and see.
There are twio ways I can see of doing it:
1/ The tokenizer itself could be moded and taught to skip at the
character level (below tokens) to find a real newline so that aspects is
easy. So the tokenizer needs upgrading without slowing it down - tuning
the tokenizer is quite important for overall performance.
2/ If the emphasis is on the error recovery, I'd experiment with reading
in two stages - reading into the large buffer the I/O uses, then reading
out a line, then parsing the line for a triple. Error recovery is throw
away the working line if it can't be parsed.
No real tokenizer changes but it does an extra copy to extract the line;
that copy may not make much difference as the data for the line is in
CPU cache and is fast to access straight after it was extracted.
(from playing with bytes to UTF-8, I know an extra copy can be faster -
the Java libraries do better for large blocks than a UTF-8 decoder I
wrote and they need an extra copy but presumably the authors know
exactly what works and what doesn't in Java even if it's not in some
native code)
For Turtle, it's harder - skipping to DOT newline is probably OK (based
on the fact that typical usage is to not have multiple blocks of triples
on one line (yes - it happens, but not much at scale).
Andy
On 6/13/12 9:17 AM, "Rob Vesse"<[email protected]> wrote:
Hi Stefan
I think the main problem here is one of error recovery. When I see
invalid data either at the tokenizer/parser level what do I actually do
with it? I.e. Where do I skip forward to in order to ignore that
invalid
triple?
For NTriples which is officially a line based format the fix would
likely
be to skip to the end of the line if hitting an error in tokenizing
and if
parsing skip to the next `.` token since we'll that if we hit the
error in
parsing (not tokenization) then we can assume the tokens are valid
syntactically but not semantically e.g. A blank node in the predicate
position. If we were talking other formats sensible error recovery
may be
much harder/impossible.
It's probably not that hard to write a Ntriples tokenizer and parser
that
does error recovery based off of the existing ones, patches are always
welcome. If I ever have some spare time I might look at this myself.
Rob
Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype:
rvesse
6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588
On 6/13/12 7:13 AM, "Stefan Scheffler"<[email protected]>
wrote:
Am 13.06.2012 15:55, schrieb Andy Seaborne:
On 13/06/12 14:19, Damian Steer wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 13/06/12 14:03, Stefan Scheffler wrote:
Hello, I need to import large n-triple files (dbpedia) into a tdb.
The problem is, that many of the triples are not valid (like
missing '<' or invalid chars) and leading to an exception which
quits the import... I just want to skip them and continue, so that
all valid triples are in the tdb at the end.
Is there a possibility to do that easily? I tried to rewrite the
ARQ, but this is very complex With friendly regards Stefan
Scheffler
You'd be much better off finding an n-triple parser that kept going
and also spat out (working) n-triples for piping to TDB. I can't see
an option like that in the riot command line.
There isn't such an option - there could be (if someone wants to
contribute a patch).
This is a typical ETL situation - you're going to have to clean those
triples (which were not written by an RDf tool presumably). Do you
want to loose them or fix them?
Checking before loading is always a good idea, especially data from
outside and other tools. When I receive TTL or RDF/XML, I parse to
NT
which means its then checked. Then load the data.
Andy
Hi Andy,
At the moment i just want to skip the invalid triples (later they
should
be stored and maybe fixed, if its possible).
The main goal is to have an import-proccess which runs automaticly and
don't stops on every found failure.
The moment of checking doesn't matter (atm ;)) . It can before or
during the import (but i used the second strategy on sesame).
Thanks Stefan
--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]
Thank you Rob and Andy for your solutions,
Yesterday i came to the same point to write an own tokenizer or modify
the existing one, because i don´t like the idea to do the checking first
and have a performance loss, which will be definitly a problem in the
future.
I thank you for your fast responses
With friendly regards
Stefan Scheffler
--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]
--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: [email protected]