Hi Stefan

I think the main problem here is one of error recovery.  When I see
invalid data either at the tokenizer/parser level what do I actually do
with it?  I.e. Where do I skip forward to in order to ignore that invalid
triple?

For NTriples which is officially a line based format the fix would likely
be to skip to the end of the line if hitting an error in tokenizing and if
parsing skip to the next `.` token since we'll that if we hit the error in
parsing (not tokenization) then we can assume the tokens are valid
syntactically but not semantically e.g. A blank node in the predicate
position.  If we were talking other formats sensible error recovery may be
much harder/impossible.

It's probably not that hard to write a Ntriples tokenizer and parser that
does error recovery based off of the existing ones, patches are always
welcome. If I ever have some spare time I might look at this myself.

Rob

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: [email protected]  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588



On 6/13/12 7:13 AM, "Stefan Scheffler" <[email protected]>
wrote:

>
>Am 13.06.2012 15:55, schrieb Andy Seaborne:
>> On 13/06/12 14:19, Damian Steer wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 13/06/12 14:03, Stefan Scheffler wrote:
>>>> Hello, I need to import large n-triple files (dbpedia) into a tdb.
>>>> The problem is, that many of the triples are not valid (like
>>>> missing '<' or invalid chars) and leading to an exception which
>>>> quits the import... I just want to skip them and continue, so that
>>>> all valid triples are in the tdb at the end.
>>>>
>>>> Is there a possibility to do that easily? I tried to rewrite the
>>>> ARQ, but this is very complex With friendly regards Stefan
>>>> Scheffler
>>>>
>>>
>>> You'd be much better off finding an n-triple parser that kept going
>>> and also spat out (working) n-triples for piping to TDB. I can't see
>>> an option like that in the riot command line.
>>
>> There isn't such an option - there could be (if someone wants to
>> contribute a patch).
>>
>> This is a typical ETL situation - you're going to have to clean those
>> triples (which were not written by an RDf tool presumably).  Do you
>> want to loose them or fix them?
>>
>> Checking before loading is always a good idea, especially data from
>> outside and other tools.  When I receive TTL or RDF/XML, I parse to NT
>> which means its then checked.  Then load the data.
>>
>>     Andy
>>
>
>   Hi Andy,
>At the moment i just want to skip the invalid triples (later they should
>be stored and maybe fixed, if its possible).
>The main goal is to have an import-proccess which runs automaticly and
>don't stops on every found failure.
>The moment of checking doesn't matter  (atm ;)) . It can before or
>during the import (but i used the second strategy on sesame).
>
>Thanks Stefan
>
>-- 
>Stefan Scheffler
>Avantgarde Labs GbR
>Löbauer Straße 19, 01099 Dresden
>Telefon: + 49 (0) 351 21590834
>Email: [email protected]
>
>
>
>
>
>

Reply via email to