On 26/08/2022 19:50, Dan Brickley wrote:
On Fri, 26 Aug 2022 at 16:27, Andy Seaborne <a...@apache.org> wrote:



On 26/08/2022 15:03, Lorenz Buehmann wrote:
I'm asking because I thought the initial loading is way faster then
iterating over multiple (graph, file) pairs and running the TDB2 loader
for each pair?

Yes. It is faster when loading from empty in a single run of a loader.

The loaders do some straight-to-index work which makes proper
transactions impossible, and so if a load has a parse error, a bypass of
transactions would, at best, break the database with half a load, or, at
worse, break the database.


Is it possible to load into new and dedicated named graphs so that such
partial loads could be easily cleaned up / reverted? Or the corruption is
deeper in the underlying data structures (index etc.)?

What sort of errors are you thinking of?

Loaders are one step of the pipeline from gettign data fro some 3rd part and into database. Their role is get data in as fast as possible within the hardware constraints.

A syntax error will be detected by the parser, and when the parser aborts the whole load aborts. Bulk loading is multiphase - load triples to get a node table, the primary index (SPO, GSPO), then build the other indexes. It is faster this way - and can have parallelism. Several loaders have various degrees of parallelism.

If it aborts, there is, at best, a partial SPO table, no other indexes. The rest of the system assumes a valid database.

Syntax errors should be caught by checking first with 'riot' if you can't trust the source.

The single-threaded loaders are transactional and will abort the load transaction. No data loaded, database is in the state as when the load started. They also work on already-existing databases.

For schema errors (SHACL, ShEx) work on valid RDF, and all loaders will work. The loaders "only" need syntactically RDF.

Schema fixup is later.

        Andy


Dan


         Andy


Reply via email to