Actually it looks like some of this stuff is already in place. If you take a look at LangNTriples in ARQ you will see it derives from LangNTuples which has a setSkipOnBadTerms() method but I can't tell whether this actually affects anything I.e. Whether it is actually honored by LangNTriples but you may want to experiment and see.
Rob Rob Vesse -- YarcData.com -- A Division of Cray Inc Software Engineer, Bay Area m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype: rvesse 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 On 6/13/12 9:17 AM, "Rob Vesse" <[email protected]> wrote: >Hi Stefan > >I think the main problem here is one of error recovery. When I see >invalid data either at the tokenizer/parser level what do I actually do >with it? I.e. Where do I skip forward to in order to ignore that invalid >triple? > >For NTriples which is officially a line based format the fix would likely >be to skip to the end of the line if hitting an error in tokenizing and if >parsing skip to the next `.` token since we'll that if we hit the error in >parsing (not tokenization) then we can assume the tokens are valid >syntactically but not semantically e.g. A blank node in the predicate >position. If we were talking other formats sensible error recovery may be >much harder/impossible. > >It's probably not that hard to write a Ntriples tokenizer and parser that >does error recovery based off of the existing ones, patches are always >welcome. If I ever have some spare time I might look at this myself. > >Rob > >Rob Vesse -- YarcData.com -- A Division of Cray Inc >Software Engineer, Bay Area >m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype: >rvesse >6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 > > > >On 6/13/12 7:13 AM, "Stefan Scheffler" <[email protected]> >wrote: > >> >>Am 13.06.2012 15:55, schrieb Andy Seaborne: >>> On 13/06/12 14:19, Damian Steer wrote: >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> On 13/06/12 14:03, Stefan Scheffler wrote: >>>>> Hello, I need to import large n-triple files (dbpedia) into a tdb. >>>>> The problem is, that many of the triples are not valid (like >>>>> missing '<' or invalid chars) and leading to an exception which >>>>> quits the import... I just want to skip them and continue, so that >>>>> all valid triples are in the tdb at the end. >>>>> >>>>> Is there a possibility to do that easily? I tried to rewrite the >>>>> ARQ, but this is very complex With friendly regards Stefan >>>>> Scheffler >>>>> >>>> >>>> You'd be much better off finding an n-triple parser that kept going >>>> and also spat out (working) n-triples for piping to TDB. I can't see >>>> an option like that in the riot command line. >>> >>> There isn't such an option - there could be (if someone wants to >>> contribute a patch). >>> >>> This is a typical ETL situation - you're going to have to clean those >>> triples (which were not written by an RDf tool presumably). Do you >>> want to loose them or fix them? >>> >>> Checking before loading is always a good idea, especially data from >>> outside and other tools. When I receive TTL or RDF/XML, I parse to NT >>> which means its then checked. Then load the data. >>> >>> Andy >>> >> >> Hi Andy, >>At the moment i just want to skip the invalid triples (later they should >>be stored and maybe fixed, if its possible). >>The main goal is to have an import-proccess which runs automaticly and >>don't stops on every found failure. >>The moment of checking doesn't matter (atm ;)) . It can before or >>during the import (but i used the second strategy on sesame). >> >>Thanks Stefan >> >>-- >>Stefan Scheffler >>Avantgarde Labs GbR >>Löbauer Straße 19, 01099 Dresden >>Telefon: + 49 (0) 351 21590834 >>Email: [email protected] >> >> >> >> >> >> >
