importing ntriples into tdb without stop at an error
Hello, I need to import large n-triple files (dbpedia) into a tdb. The problem is, that many of the triples are not valid (like missing '' or invalid chars) and leading to an exception which quits the import... I just want to skip them and continue, so that all valid triples are in the tdb at the end. Is there a possibility to do that easily? I tried to rewrite the ARQ, but this is very complex With friendly regards Stefan Scheffler -- Stefan Scheffler Avantgarde Labs GbR Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de
Re: importing ntriples into tdb without stop at an error
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 13/06/12 14:03, Stefan Scheffler wrote: Hello, I need to import large n-triple files (dbpedia) into a tdb. The problem is, that many of the triples are not valid (like missing '' or invalid chars) and leading to an exception which quits the import... I just want to skip them and continue, so that all valid triples are in the tdb at the end. Is there a possibility to do that easily? I tried to rewrite the ARQ, but this is very complex With friendly regards Stefan Scheffler You'd be much better off finding an n-triple parser that kept going and also spat out (working) n-triples for piping to TDB. I can't see an option like that in the riot command line. Damian -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk/Yk1UACgkQAyLCB+mTtynCxwCdGO4xFNd3sJaLqFGGRzMtMaqH p+kAn0tS4RXd/1iroz+UuahFefyjfxbq =2jgU -END PGP SIGNATURE-
Re: importing ntriples into tdb without stop at an error
On 13/06/12 14:19, Damian Steer wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 13/06/12 14:03, Stefan Scheffler wrote: Hello, I need to import large n-triple files (dbpedia) into a tdb. The problem is, that many of the triples are not valid (like missing '' or invalid chars) and leading to an exception which quits the import... I just want to skip them and continue, so that all valid triples are in the tdb at the end. Is there a possibility to do that easily? I tried to rewrite the ARQ, but this is very complex With friendly regards Stefan Scheffler You'd be much better off finding an n-triple parser that kept going and also spat out (working) n-triples for piping to TDB. I can't see an option like that in the riot command line. There isn't such an option - there could be (if someone wants to contribute a patch). This is a typical ETL situation - you're going to have to clean those triples (which were not written by an RDf tool presumably). Do you want to loose them or fix them? Checking before loading is always a good idea, especially data from outside and other tools. When I receive TTL or RDF/XML, I parse to NT which means its then checked. Then load the data. Andy Damian -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk/Yk1UACgkQAyLCB+mTtynCxwCdGO4xFNd3sJaLqFGGRzMtMaqH p+kAn0tS4RXd/1iroz+UuahFefyjfxbq =2jgU -END PGP SIGNATURE-
Re: importing ntriples into tdb without stop at an error
Actually it looks like some of this stuff is already in place. If you take a look at LangNTriples in ARQ you will see it derives from LangNTuples which has a setSkipOnBadTerms() method but I can't tell whether this actually affects anything I.e. Whether it is actually honored by LangNTriples but you may want to experiment and see. Rob Rob Vesse -- YarcData.com -- A Division of Cray Inc Software Engineer, Bay Area m: 925.960.3941 | o: 925.264.4729 | @: rve...@yarcdata.com | Skype: rvesse 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 On 6/13/12 9:17 AM, Rob Vesse rve...@cray.com wrote: Hi Stefan I think the main problem here is one of error recovery. When I see invalid data either at the tokenizer/parser level what do I actually do with it? I.e. Where do I skip forward to in order to ignore that invalid triple? For NTriples which is officially a line based format the fix would likely be to skip to the end of the line if hitting an error in tokenizing and if parsing skip to the next `.` token since we'll that if we hit the error in parsing (not tokenization) then we can assume the tokens are valid syntactically but not semantically e.g. A blank node in the predicate position. If we were talking other formats sensible error recovery may be much harder/impossible. It's probably not that hard to write a Ntriples tokenizer and parser that does error recovery based off of the existing ones, patches are always welcome. If I ever have some spare time I might look at this myself. Rob Rob Vesse -- YarcData.com -- A Division of Cray Inc Software Engineer, Bay Area m: 925.960.3941 | o: 925.264.4729 | @: rve...@yarcdata.com | Skype: rvesse 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 On 6/13/12 7:13 AM, Stefan Scheffler sscheff...@avantgarde-labs.de wrote: Am 13.06.2012 15:55, schrieb Andy Seaborne: On 13/06/12 14:19, Damian Steer wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 13/06/12 14:03, Stefan Scheffler wrote: Hello, I need to import large n-triple files (dbpedia) into a tdb. The problem is, that many of the triples are not valid (like missing '' or invalid chars) and leading to an exception which quits the import... I just want to skip them and continue, so that all valid triples are in the tdb at the end. Is there a possibility to do that easily? I tried to rewrite the ARQ, but this is very complex With friendly regards Stefan Scheffler You'd be much better off finding an n-triple parser that kept going and also spat out (working) n-triples for piping to TDB. I can't see an option like that in the riot command line. There isn't such an option - there could be (if someone wants to contribute a patch). This is a typical ETL situation - you're going to have to clean those triples (which were not written by an RDf tool presumably). Do you want to loose them or fix them? Checking before loading is always a good idea, especially data from outside and other tools. When I receive TTL or RDF/XML, I parse to NT which means its then checked. Then load the data. Andy Hi Andy, At the moment i just want to skip the invalid triples (later they should be stored and maybe fixed, if its possible). The main goal is to have an import-proccess which runs automaticly and don't stops on every found failure. The moment of checking doesn't matter (atm ;)) . It can before or during the import (but i used the second strategy on sesame). Thanks Stefan -- Stefan Scheffler Avantgarde Labs GbR Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de
Re: importing ntriples into tdb without stop at an error
On 13/06/12 17:52, Rob Vesse wrote: Actually it looks like some of this stuff is already in place. If you take a look at LangNTriples in ARQ you will see it derives from LangNTuples which has a setSkipOnBadTerms() method but I can't tell whether this actually affects anything I.e. Whether it is actually honored by LangNTriples but you may want to experiment and see. There are twio ways I can see of doing it: 1/ The tokenizer itself could be moded and taught to skip at the character level (below tokens) to find a real newline so that aspects is easy. So the tokenizer needs upgrading without slowing it down - tuning the tokenizer is quite important for overall performance. 2/ If the emphasis is on the error recovery, I'd experiment with reading in two stages - reading into the large buffer the I/O uses, then reading out a line, then parsing the line for a triple. Error recovery is throw away the working line if it can't be parsed. No real tokenizer changes but it does an extra copy to extract the line; that copy may not make much difference as the data for the line is in CPU cache and is fast to access straight after it was extracted. (from playing with bytes to UTF-8, I know an extra copy can be faster - the Java libraries do better for large blocks than a UTF-8 decoder I wrote and they need an extra copy but presumably the authors know exactly what works and what doesn't in Java even if it's not in some native code) For Turtle, it's harder - skipping to DOT newline is probably OK (based on the fact that typical usage is to not have multiple blocks of triples on one line (yes - it happens, but not much at scale). Andy On 6/13/12 9:17 AM, Rob Vesserve...@cray.com wrote: Hi Stefan I think the main problem here is one of error recovery. When I see invalid data either at the tokenizer/parser level what do I actually do with it? I.e. Where do I skip forward to in order to ignore that invalid triple? For NTriples which is officially a line based format the fix would likely be to skip to the end of the line if hitting an error in tokenizing and if parsing skip to the next `.` token since we'll that if we hit the error in parsing (not tokenization) then we can assume the tokens are valid syntactically but not semantically e.g. A blank node in the predicate position. If we were talking other formats sensible error recovery may be much harder/impossible. It's probably not that hard to write a Ntriples tokenizer and parser that does error recovery based off of the existing ones, patches are always welcome. If I ever have some spare time I might look at this myself. Rob Rob Vesse -- YarcData.com -- A Division of Cray Inc Software Engineer, Bay Area m: 925.960.3941 | o: 925.264.4729 | @: rve...@yarcdata.com | Skype: rvesse 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 On 6/13/12 7:13 AM, Stefan Schefflersscheff...@avantgarde-labs.de wrote: Am 13.06.2012 15:55, schrieb Andy Seaborne: On 13/06/12 14:19, Damian Steer wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 13/06/12 14:03, Stefan Scheffler wrote: Hello, I need to import large n-triple files (dbpedia) into a tdb. The problem is, that many of the triples are not valid (like missing '' or invalid chars) and leading to an exception which quits the import... I just want to skip them and continue, so that all valid triples are in the tdb at the end. Is there a possibility to do that easily? I tried to rewrite the ARQ, but this is very complex With friendly regards Stefan Scheffler You'd be much better off finding an n-triple parser that kept going and also spat out (working) n-triples for piping to TDB. I can't see an option like that in the riot command line. There isn't such an option - there could be (if someone wants to contribute a patch). This is a typical ETL situation - you're going to have to clean those triples (which were not written by an RDf tool presumably). Do you want to loose them or fix them? Checking before loading is always a good idea, especially data from outside and other tools. When I receive TTL or RDF/XML, I parse to NT which means its then checked. Then load the data. Andy Hi Andy, At the moment i just want to skip the invalid triples (later they should be stored and maybe fixed, if its possible). The main goal is to have an import-proccess which runs automaticly and don't stops on every found failure. The moment of checking doesn't matter (atm ;)) . It can before or during the import (but i used the second strategy on sesame). Thanks Stefan -- Stefan Scheffler Avantgarde Labs GbR Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de