importing ntriples into tdb without stop at an error

2012-06-13 Thread Stefan Scheffler

Hello,
I need to import large n-triple files (dbpedia) into a tdb. The problem 
is, that many of the triples are not valid (like missing '' or invalid 
chars) and leading to an exception which quits the import... I just want 
to skip them and continue, so that all valid triples are in the tdb at 
the end.


Is there a possibility to do that easily? I tried to rewrite the ARQ, 
but this is very complex

With friendly regards
Stefan Scheffler

--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de



Re: importing ntriples into tdb without stop at an error

2012-06-13 Thread Damian Steer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/06/12 14:03, Stefan Scheffler wrote:
 Hello, I need to import large n-triple files (dbpedia) into a tdb.
 The problem is, that many of the triples are not valid (like
 missing '' or invalid chars) and leading to an exception which
 quits the import... I just want to skip them and continue, so that
 all valid triples are in the tdb at the end.
 
 Is there a possibility to do that easily? I tried to rewrite the
 ARQ, but this is very complex With friendly regards Stefan
 Scheffler
 

You'd be much better off finding an n-triple parser that kept going
and also spat out (working) n-triples for piping to TDB. I can't see
an option like that in the riot command line.

Damian
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk/Yk1UACgkQAyLCB+mTtynCxwCdGO4xFNd3sJaLqFGGRzMtMaqH
p+kAn0tS4RXd/1iroz+UuahFefyjfxbq
=2jgU
-END PGP SIGNATURE-


Re: importing ntriples into tdb without stop at an error

2012-06-13 Thread Andy Seaborne

On 13/06/12 14:19, Damian Steer wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/06/12 14:03, Stefan Scheffler wrote:

Hello, I need to import large n-triple files (dbpedia) into a tdb.
The problem is, that many of the triples are not valid (like
missing '' or invalid chars) and leading to an exception which
quits the import... I just want to skip them and continue, so that
all valid triples are in the tdb at the end.

Is there a possibility to do that easily? I tried to rewrite the
ARQ, but this is very complex With friendly regards Stefan
Scheffler



You'd be much better off finding an n-triple parser that kept going
and also spat out (working) n-triples for piping to TDB. I can't see
an option like that in the riot command line.


There isn't such an option - there could be (if someone wants to 
contribute a patch).


This is a typical ETL situation - you're going to have to clean those 
triples (which were not written by an RDf tool presumably).  Do you want 
to loose them or fix them?


Checking before loading is always a good idea, especially data from 
outside and other tools.  When I receive TTL or RDF/XML, I parse to NT 
which means its then checked.  Then load the data.


Andy



Damian
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk/Yk1UACgkQAyLCB+mTtynCxwCdGO4xFNd3sJaLqFGGRzMtMaqH
p+kAn0tS4RXd/1iroz+UuahFefyjfxbq
=2jgU
-END PGP SIGNATURE-




Re: importing ntriples into tdb without stop at an error

2012-06-13 Thread Rob Vesse
Actually it looks like some of this stuff is already in place.  If you
take a look at LangNTriples in ARQ you will see it derives from
LangNTuples which has a setSkipOnBadTerms() method but I can't tell
whether this actually affects anything I.e. Whether it is actually honored
by LangNTriples but you may want to experiment and see.

Rob

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: rve...@yarcdata.com  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588






On 6/13/12 9:17 AM, Rob Vesse rve...@cray.com wrote:

Hi Stefan

I think the main problem here is one of error recovery.  When I see
invalid data either at the tokenizer/parser level what do I actually do
with it?  I.e. Where do I skip forward to in order to ignore that invalid
triple?

For NTriples which is officially a line based format the fix would likely
be to skip to the end of the line if hitting an error in tokenizing and if
parsing skip to the next `.` token since we'll that if we hit the error in
parsing (not tokenization) then we can assume the tokens are valid
syntactically but not semantically e.g. A blank node in the predicate
position.  If we were talking other formats sensible error recovery may be
much harder/impossible.

It's probably not that hard to write a Ntriples tokenizer and parser that
does error recovery based off of the existing ones, patches are always
welcome. If I ever have some spare time I might look at this myself.

Rob

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: rve...@yarcdata.com  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588



On 6/13/12 7:13 AM, Stefan Scheffler sscheff...@avantgarde-labs.de
wrote:


Am 13.06.2012 15:55, schrieb Andy Seaborne:
 On 13/06/12 14:19, Damian Steer wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 13/06/12 14:03, Stefan Scheffler wrote:
 Hello, I need to import large n-triple files (dbpedia) into a tdb.
 The problem is, that many of the triples are not valid (like
 missing '' or invalid chars) and leading to an exception which
 quits the import... I just want to skip them and continue, so that
 all valid triples are in the tdb at the end.

 Is there a possibility to do that easily? I tried to rewrite the
 ARQ, but this is very complex With friendly regards Stefan
 Scheffler


 You'd be much better off finding an n-triple parser that kept going
 and also spat out (working) n-triples for piping to TDB. I can't see
 an option like that in the riot command line.

 There isn't such an option - there could be (if someone wants to
 contribute a patch).

 This is a typical ETL situation - you're going to have to clean those
 triples (which were not written by an RDf tool presumably).  Do you
 want to loose them or fix them?

 Checking before loading is always a good idea, especially data from
 outside and other tools.  When I receive TTL or RDF/XML, I parse to NT
 which means its then checked.  Then load the data.

 Andy


   Hi Andy,
At the moment i just want to skip the invalid triples (later they should
be stored and maybe fixed, if its possible).
The main goal is to have an import-proccess which runs automaticly and
don't stops on every found failure.
The moment of checking doesn't matter  (atm ;)) . It can before or
during the import (but i used the second strategy on sesame).

Thanks Stefan

-- 
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de










Re: importing ntriples into tdb without stop at an error

2012-06-13 Thread Andy Seaborne

On 13/06/12 17:52, Rob Vesse wrote:

Actually it looks like some of this stuff is already in place.  If you
take a look at LangNTriples in ARQ you will see it derives from
LangNTuples which has a setSkipOnBadTerms() method but I can't tell
whether this actually affects anything I.e. Whether it is actually honored
by LangNTriples but you may want to experiment and see.


There are twio ways I can see of doing it:

1/ The tokenizer itself could be moded and taught to skip at the 
character level (below tokens) to find a real newline so that aspects is 
easy.  So the tokenizer needs upgrading without slowing it down - tuning 
the tokenizer is quite important for overall performance.


2/ If the emphasis is on the error recovery, I'd experiment with reading 
in two stages - reading into the large buffer the I/O uses, then reading 
out a line, then parsing the line for a triple.  Error recovery is throw 
away the working line if it can't be parsed.


No real tokenizer changes but it does an extra copy to extract the line; 
that copy may not make much difference as the data for the line is in 
CPU cache and is fast to access straight after it was extracted.


(from playing with bytes to UTF-8, I know an extra copy can be faster - 
the Java libraries do better for large blocks than a UTF-8 decoder I 
wrote and they need an extra copy but presumably the authors know 
exactly what works and what doesn't in Java even if it's not in some 
native code)


For Turtle, it's harder - skipping to DOT newline is probably OK (based 
on the fact that typical usage is to not have multiple blocks of triples 
on one line (yes - it happens, but not much at scale).


Andy



On 6/13/12 9:17 AM, Rob Vesserve...@cray.com  wrote:


Hi Stefan

I think the main problem here is one of error recovery.  When I see
invalid data either at the tokenizer/parser level what do I actually do
with it?  I.e. Where do I skip forward to in order to ignore that invalid
triple?

For NTriples which is officially a line based format the fix would likely
be to skip to the end of the line if hitting an error in tokenizing and if
parsing skip to the next `.` token since we'll that if we hit the error in
parsing (not tokenization) then we can assume the tokens are valid
syntactically but not semantically e.g. A blank node in the predicate
position.  If we were talking other formats sensible error recovery may be
much harder/impossible.

It's probably not that hard to write a Ntriples tokenizer and parser that
does error recovery based off of the existing ones, patches are always
welcome. If I ever have some spare time I might look at this myself.

Rob

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: rve...@yarcdata.com  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588



On 6/13/12 7:13 AM, Stefan Schefflersscheff...@avantgarde-labs.de
wrote:



Am 13.06.2012 15:55, schrieb Andy Seaborne:

On 13/06/12 14:19, Damian Steer wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 13/06/12 14:03, Stefan Scheffler wrote:

Hello, I need to import large n-triple files (dbpedia) into a tdb.
The problem is, that many of the triples are not valid (like
missing '' or invalid chars) and leading to an exception which
quits the import... I just want to skip them and continue, so that
all valid triples are in the tdb at the end.

Is there a possibility to do that easily? I tried to rewrite the
ARQ, but this is very complex With friendly regards Stefan
Scheffler



You'd be much better off finding an n-triple parser that kept going
and also spat out (working) n-triples for piping to TDB. I can't see
an option like that in the riot command line.


There isn't such an option - there could be (if someone wants to
contribute a patch).

This is a typical ETL situation - you're going to have to clean those
triples (which were not written by an RDf tool presumably).  Do you
want to loose them or fix them?

Checking before loading is always a good idea, especially data from
outside and other tools.  When I receive TTL or RDF/XML, I parse to NT
which means its then checked.  Then load the data.

 Andy



   Hi Andy,
At the moment i just want to skip the invalid triples (later they should
be stored and maybe fixed, if its possible).
The main goal is to have an import-proccess which runs automaticly and
don't stops on every found failure.
The moment of checking doesn't matter  (atm ;)) . It can before or
during the import (but i used the second strategy on sesame).

Thanks Stefan

--
Stefan Scheffler
Avantgarde Labs GbR
Löbauer Straße 19, 01099 Dresden
Telefon: + 49 (0) 351 21590834
Email: sscheff...@avantgarde-labs.de