Hi Stefan I think the main problem here is one of error recovery. When I see invalid data either at the tokenizer/parser level what do I actually do with it? I.e. Where do I skip forward to in order to ignore that invalid triple?
For NTriples which is officially a line based format the fix would likely be to skip to the end of the line if hitting an error in tokenizing and if parsing skip to the next `.` token since we'll that if we hit the error in parsing (not tokenization) then we can assume the tokens are valid syntactically but not semantically e.g. A blank node in the predicate position. If we were talking other formats sensible error recovery may be much harder/impossible. It's probably not that hard to write a Ntriples tokenizer and parser that does error recovery based off of the existing ones, patches are always welcome. If I ever have some spare time I might look at this myself. Rob Rob Vesse -- YarcData.com -- A Division of Cray Inc Software Engineer, Bay Area m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype: rvesse 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 On 6/13/12 7:13 AM, "Stefan Scheffler" <[email protected]> wrote: > >Am 13.06.2012 15:55, schrieb Andy Seaborne: >> On 13/06/12 14:19, Damian Steer wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> On 13/06/12 14:03, Stefan Scheffler wrote: >>>> Hello, I need to import large n-triple files (dbpedia) into a tdb. >>>> The problem is, that many of the triples are not valid (like >>>> missing '<' or invalid chars) and leading to an exception which >>>> quits the import... I just want to skip them and continue, so that >>>> all valid triples are in the tdb at the end. >>>> >>>> Is there a possibility to do that easily? I tried to rewrite the >>>> ARQ, but this is very complex With friendly regards Stefan >>>> Scheffler >>>> >>> >>> You'd be much better off finding an n-triple parser that kept going >>> and also spat out (working) n-triples for piping to TDB. I can't see >>> an option like that in the riot command line. >> >> There isn't such an option - there could be (if someone wants to >> contribute a patch). >> >> This is a typical ETL situation - you're going to have to clean those >> triples (which were not written by an RDf tool presumably). Do you >> want to loose them or fix them? >> >> Checking before loading is always a good idea, especially data from >> outside and other tools. When I receive TTL or RDF/XML, I parse to NT >> which means its then checked. Then load the data. >> >> Andy >> > > Hi Andy, >At the moment i just want to skip the invalid triples (later they should >be stored and maybe fixed, if its possible). >The main goal is to have an import-proccess which runs automaticly and >don't stops on every found failure. >The moment of checking doesn't matter (atm ;)) . It can before or >during the import (but i used the second strategy on sesame). > >Thanks Stefan > >-- >Stefan Scheffler >Avantgarde Labs GbR >Löbauer Straße 19, 01099 Dresden >Telefon: + 49 (0) 351 21590834 >Email: [email protected] > > > > > >
