Re: importing ntriples into tdb without stop at an error

Rob Vesse Wed, 13 Jun 2012 09:53:06 -0700

Actually it looks like some of this stuff is already in place.  If you
take a look at LangNTriples in ARQ you will see it derives from
LangNTuples which has a setSkipOnBadTerms() method but I can't tell
whether this actually affects anything I.e. Whether it is actually honored
by LangNTriples but you may want to experiment and see.


Rob

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: [email protected]  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588






On 6/13/12 9:17 AM, "Rob Vesse" <[email protected]> wrote:

>Hi Stefan
>
>I think the main problem here is one of error recovery.  When I see
>invalid data either at the tokenizer/parser level what do I actually do
>with it?  I.e. Where do I skip forward to in order to ignore that invalid
>triple?
>
>For NTriples which is officially a line based format the fix would likely
>be to skip to the end of the line if hitting an error in tokenizing and if
>parsing skip to the next `.` token since we'll that if we hit the error in
>parsing (not tokenization) then we can assume the tokens are valid
>syntactically but not semantically e.g. A blank node in the predicate
>position.  If we were talking other formats sensible error recovery may be
>much harder/impossible.
>
>It's probably not that hard to write a Ntriples tokenizer and parser that
>does error recovery based off of the existing ones, patches are always
>welcome. If I ever have some spare time I might look at this myself.
>
>Rob
>
>Rob Vesse -- YarcData.com -- A Division of Cray Inc
>Software Engineer, Bay Area
>m: 925.960.3941  |  o: 925.264.4729 | @: [email protected]  |  Skype:
>rvesse
>6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588
>
>
>
>On 6/13/12 7:13 AM, "Stefan Scheffler" <[email protected]>
>wrote:
>
>>
>>Am 13.06.2012 15:55, schrieb Andy Seaborne:
>>> On 13/06/12 14:19, Damian Steer wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 13/06/12 14:03, Stefan Scheffler wrote:
>>>>> Hello, I need to import large n-triple files (dbpedia) into a tdb.
>>>>> The problem is, that many of the triples are not valid (like
>>>>> missing '<' or invalid chars) and leading to an exception which
>>>>> quits the import... I just want to skip them and continue, so that
>>>>> all valid triples are in the tdb at the end.
>>>>>
>>>>> Is there a possibility to do that easily? I tried to rewrite the
>>>>> ARQ, but this is very complex With friendly regards Stefan
>>>>> Scheffler
>>>>>
>>>>
>>>> You'd be much better off finding an n-triple parser that kept going
>>>> and also spat out (working) n-triples for piping to TDB. I can't see
>>>> an option like that in the riot command line.
>>>
>>> There isn't such an option - there could be (if someone wants to
>>> contribute a patch).
>>>
>>> This is a typical ETL situation - you're going to have to clean those
>>> triples (which were not written by an RDf tool presumably).  Do you
>>> want to loose them or fix them?
>>>
>>> Checking before loading is always a good idea, especially data from
>>> outside and other tools.  When I receive TTL or RDF/XML, I parse to NT
>>> which means its then checked.  Then load the data.
>>>
>>>     Andy
>>>
>>
>>   Hi Andy,
>>At the moment i just want to skip the invalid triples (later they should
>>be stored and maybe fixed, if its possible).
>>The main goal is to have an import-proccess which runs automaticly and
>>don't stops on every found failure.
>>The moment of checking doesn't matter  (atm ;)) . It can before or
>>during the import (but i used the second strategy on sesame).
>>
>>Thanks Stefan
>>
>>-- 
>>Stefan Scheffler
>>Avantgarde Labs GbR
>>Löbauer Straße 19, 01099 Dresden
>>Telefon: + 49 (0) 351 21590834
>>Email: [email protected]
>>
>>
>>
>>
>>
>>
>

Re: importing ntriples into tdb without stop at an error

Reply via email to