Yes I think it is definitely tricky, I would probably suggest having a separate tokenizer implementation rather than trying to add this functionality onto the existing one. That way there is scope for adding more complex error recovery at a later date and you don't affect current performance.
Rob Vesse -- YarcData.com -- A Division of Cray Inc Software Engineer, Bay Area m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype: rvesse 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 On 6/13/12 12:52 PM, "Andy Seaborne" <[email protected]> wrote: >On 13/06/12 17:52, Rob Vesse wrote: >> Actually it looks like some of this stuff is already in place. If you >> take a look at LangNTriples in ARQ you will see it derives from >> LangNTuples which has a setSkipOnBadTerms() method but I can't tell >> whether this actually affects anything I.e. Whether it is actually >>honored >> by LangNTriples but you may want to experiment and see. > >There are twio ways I can see of doing it: > >1/ The tokenizer itself could be moded and taught to skip at the >character level (below tokens) to find a real newline so that aspects is >easy. So the tokenizer needs upgrading without slowing it down - tuning >the tokenizer is quite important for overall performance. > >2/ If the emphasis is on the error recovery, I'd experiment with reading >in two stages - reading into the large buffer the I/O uses, then reading >out a line, then parsing the line for a triple. Error recovery is throw >away the working line if it can't be parsed. > >No real tokenizer changes but it does an extra copy to extract the line; >that copy may not make much difference as the data for the line is in >CPU cache and is fast to access straight after it was extracted. > >(from playing with bytes to UTF-8, I know an extra copy can be faster - >the Java libraries do better for large blocks than a UTF-8 decoder I >wrote and they need an extra copy but presumably the authors know >exactly what works and what doesn't in Java even if it's not in some >native code) > >For Turtle, it's harder - skipping to DOT newline is probably OK (based >on the fact that typical usage is to not have multiple blocks of triples >on one line (yes - it happens, but not much at scale). > > Andy > > >> On 6/13/12 9:17 AM, "Rob Vesse"<[email protected]> wrote: >> >>> Hi Stefan >>> >>> I think the main problem here is one of error recovery. When I see >>> invalid data either at the tokenizer/parser level what do I actually do >>> with it? I.e. Where do I skip forward to in order to ignore that >>>invalid >>> triple? >>> >>> For NTriples which is officially a line based format the fix would >>>likely >>> be to skip to the end of the line if hitting an error in tokenizing >>>and if >>> parsing skip to the next `.` token since we'll that if we hit the >>>error in >>> parsing (not tokenization) then we can assume the tokens are valid >>> syntactically but not semantically e.g. A blank node in the predicate >>> position. If we were talking other formats sensible error recovery >>>may be >>> much harder/impossible. >>> >>> It's probably not that hard to write a Ntriples tokenizer and parser >>>that >>> does error recovery based off of the existing ones, patches are always >>> welcome. If I ever have some spare time I might look at this myself. >>> >>> Rob >>> >>> Rob Vesse -- YarcData.com -- A Division of Cray Inc >>> Software Engineer, Bay Area >>> m: 925.960.3941 | o: 925.264.4729 | @: [email protected] | Skype: >>> rvesse >>> 6210 Stoneridge Mall Rd | Suite 120 | Pleasanton CA, 94588 >>> >>> >>> >>> On 6/13/12 7:13 AM, "Stefan Scheffler"<[email protected]> >>> wrote: >>> >>>> >>>> Am 13.06.2012 15:55, schrieb Andy Seaborne: >>>>> On 13/06/12 14:19, Damian Steer wrote: >>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>> Hash: SHA1 >>>>>> >>>>>> On 13/06/12 14:03, Stefan Scheffler wrote: >>>>>>> Hello, I need to import large n-triple files (dbpedia) into a tdb. >>>>>>> The problem is, that many of the triples are not valid (like >>>>>>> missing '<' or invalid chars) and leading to an exception which >>>>>>> quits the import... I just want to skip them and continue, so that >>>>>>> all valid triples are in the tdb at the end. >>>>>>> >>>>>>> Is there a possibility to do that easily? I tried to rewrite the >>>>>>> ARQ, but this is very complex With friendly regards Stefan >>>>>>> Scheffler >>>>>>> >>>>>> >>>>>> You'd be much better off finding an n-triple parser that kept going >>>>>> and also spat out (working) n-triples for piping to TDB. I can't see >>>>>> an option like that in the riot command line. >>>>> >>>>> There isn't such an option - there could be (if someone wants to >>>>> contribute a patch). >>>>> >>>>> This is a typical ETL situation - you're going to have to clean those >>>>> triples (which were not written by an RDf tool presumably). Do you >>>>> want to loose them or fix them? >>>>> >>>>> Checking before loading is always a good idea, especially data from >>>>> outside and other tools. When I receive TTL or RDF/XML, I parse to >>>>>NT >>>>> which means its then checked. Then load the data. >>>>> >>>>> Andy >>>>> >>>> >>>> Hi Andy, >>>> At the moment i just want to skip the invalid triples (later they >>>>should >>>> be stored and maybe fixed, if its possible). >>>> The main goal is to have an import-proccess which runs automaticly and >>>> don't stops on every found failure. >>>> The moment of checking doesn't matter (atm ;)) . It can before or >>>> during the import (but i used the second strategy on sesame). >>>> >>>> Thanks Stefan >>>> >>>> -- >>>> Stefan Scheffler >>>> Avantgarde Labs GbR >>>> Löbauer Straße 19, 01099 Dresden >>>> Telefon: + 49 (0) 351 21590834 >>>> Email: [email protected] >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >
