Yes I think it is definitely tricky, I would probably suggest having a
separate tokenizer implementation rather than trying to add this
functionality onto the existing one.  That way there is scope for adding
more complex error recovery at a later date and you don't affect current
performance.

Rob Vesse -- YarcData.com -- A Division of Cray Inc
Software Engineer, Bay Area
m: 925.960.3941  |  o: 925.264.4729 | @: [email protected]  |  Skype:
rvesse
6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588






On 6/13/12 12:52 PM, "Andy Seaborne" <[email protected]> wrote:

>On 13/06/12 17:52, Rob Vesse wrote:
>> Actually it looks like some of this stuff is already in place.  If you
>> take a look at LangNTriples in ARQ you will see it derives from
>> LangNTuples which has a setSkipOnBadTerms() method but I can't tell
>> whether this actually affects anything I.e. Whether it is actually
>>honored
>> by LangNTriples but you may want to experiment and see.
>
>There are twio ways I can see of doing it:
>
>1/ The tokenizer itself could be moded and taught to skip at the
>character level (below tokens) to find a real newline so that aspects is
>easy.  So the tokenizer needs upgrading without slowing it down - tuning
>the tokenizer is quite important for overall performance.
>
>2/ If the emphasis is on the error recovery, I'd experiment with reading
>in two stages - reading into the large buffer the I/O uses, then reading
>out a line, then parsing the line for a triple.  Error recovery is throw
>away the working line if it can't be parsed.
>
>No real tokenizer changes but it does an extra copy to extract the line;
>that copy may not make much difference as the data for the line is in
>CPU cache and is fast to access straight after it was extracted.
>
>(from playing with bytes to UTF-8, I know an extra copy can be faster -
>the Java libraries do better for large blocks than a UTF-8 decoder I
>wrote and they need an extra copy but presumably the authors know
>exactly what works and what doesn't in Java even if it's not in some
>native code)
>
>For Turtle, it's harder - skipping to DOT newline is probably OK (based
>on the fact that typical usage is to not have multiple blocks of triples
>on one line (yes - it happens, but not much at scale).
>
>       Andy
>
>
>> On 6/13/12 9:17 AM, "Rob Vesse"<[email protected]>  wrote:
>>
>>> Hi Stefan
>>>
>>> I think the main problem here is one of error recovery.  When I see
>>> invalid data either at the tokenizer/parser level what do I actually do
>>> with it?  I.e. Where do I skip forward to in order to ignore that
>>>invalid
>>> triple?
>>>
>>> For NTriples which is officially a line based format the fix would
>>>likely
>>> be to skip to the end of the line if hitting an error in tokenizing
>>>and if
>>> parsing skip to the next `.` token since we'll that if we hit the
>>>error in
>>> parsing (not tokenization) then we can assume the tokens are valid
>>> syntactically but not semantically e.g. A blank node in the predicate
>>> position.  If we were talking other formats sensible error recovery
>>>may be
>>> much harder/impossible.
>>>
>>> It's probably not that hard to write a Ntriples tokenizer and parser
>>>that
>>> does error recovery based off of the existing ones, patches are always
>>> welcome. If I ever have some spare time I might look at this myself.
>>>
>>> Rob
>>>
>>> Rob Vesse -- YarcData.com -- A Division of Cray Inc
>>> Software Engineer, Bay Area
>>> m: 925.960.3941  |  o: 925.264.4729 | @: [email protected]  |  Skype:
>>> rvesse
>>> 6210 Stoneridge Mall Rd  |  Suite 120  | Pleasanton CA, 94588
>>>
>>>
>>>
>>> On 6/13/12 7:13 AM, "Stefan Scheffler"<[email protected]>
>>> wrote:
>>>
>>>>
>>>> Am 13.06.2012 15:55, schrieb Andy Seaborne:
>>>>> On 13/06/12 14:19, Damian Steer wrote:
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA1
>>>>>>
>>>>>> On 13/06/12 14:03, Stefan Scheffler wrote:
>>>>>>> Hello, I need to import large n-triple files (dbpedia) into a tdb.
>>>>>>> The problem is, that many of the triples are not valid (like
>>>>>>> missing '<' or invalid chars) and leading to an exception which
>>>>>>> quits the import... I just want to skip them and continue, so that
>>>>>>> all valid triples are in the tdb at the end.
>>>>>>>
>>>>>>> Is there a possibility to do that easily? I tried to rewrite the
>>>>>>> ARQ, but this is very complex With friendly regards Stefan
>>>>>>> Scheffler
>>>>>>>
>>>>>>
>>>>>> You'd be much better off finding an n-triple parser that kept going
>>>>>> and also spat out (working) n-triples for piping to TDB. I can't see
>>>>>> an option like that in the riot command line.
>>>>>
>>>>> There isn't such an option - there could be (if someone wants to
>>>>> contribute a patch).
>>>>>
>>>>> This is a typical ETL situation - you're going to have to clean those
>>>>> triples (which were not written by an RDf tool presumably).  Do you
>>>>> want to loose them or fix them?
>>>>>
>>>>> Checking before loading is always a good idea, especially data from
>>>>> outside and other tools.  When I receive TTL or RDF/XML, I parse to
>>>>>NT
>>>>> which means its then checked.  Then load the data.
>>>>>
>>>>>      Andy
>>>>>
>>>>
>>>>    Hi Andy,
>>>> At the moment i just want to skip the invalid triples (later they
>>>>should
>>>> be stored and maybe fixed, if its possible).
>>>> The main goal is to have an import-proccess which runs automaticly and
>>>> don't stops on every found failure.
>>>> The moment of checking doesn't matter  (atm ;)) . It can before or
>>>> during the import (but i used the second strategy on sesame).
>>>>
>>>> Thanks Stefan
>>>>
>>>> --
>>>> Stefan Scheffler
>>>> Avantgarde Labs GbR
>>>> Löbauer Straße 19, 01099 Dresden
>>>> Telefon: + 49 (0) 351 21590834
>>>> Email: [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to