Thrift problem / corruption on large TDB2 Fuseki dataset

Osma Suominen Wed, 29 Mar 2023 00:06:55 -0700

Hi all,

we've recently rebuilt our two main virtual servers running Fuseki whichare the backend databases of the Finto.fi vocabulary service. Afterrunning the new servers for a few weeks we've already seen two cases,one on each server, where some of the SPARQL queries start failing.

In our setup, Fuseki is managing a relatively large TDB2 database alongwith a jena-text index. We keep the data in around 50 named graphs (oneper vocabulary) and each graph is typically updated using s-put,replacing the whole graph in-place. When all data is initially loaded,the database directory takes around 25GB on one server and 44GB on theother. TDB2 tends to keep growing over time though, so around once amonth we delete the whole database and rebuild it from RDF source filesthat we keep under version control; the master data doesn't residewithin Fuseki.

When queries start failing, the Fuseki logs show long tracebacks, butthe beef seems to be these two exceptions:

org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0

I've put the whole tracebacks into a gist [1]: there is one tracebackfor a failed SELECT query and another for a failed update query.

In both cases, the database size had grown to over 100GB when thishappened. A restart of Fuseki didn't help, but rebuilding the wholedatabase made the problem go away - for now. But this is making meworried that it could happen again any time. I'm sorry that I don't havean easy way of reproducing the problem at this time, as it only seems tohappen after Fuseki has been running for a few weeks and done manydifferent update operations on the TDB2 dataset.

I searched online for similar issues and exceptions. I could only findone gist [2] showing a very similar traceback, which apparently happenedwhen running a compact operation on Fuseki 4.3.1. There was also asimilar issue [3] reported in Apache Impala, and there the problem wasfixed by adding more careful checks into the code writing Thrift nodemetadata. But that code is written in C++, so it's a bit hard to comparethat to the Jena codebase.

The Impala issue says: "Since IMPALA-1048 we writeTRuntimeProfileNode.node_metadata unconditionally, even when both itsfields are unset. This trips up the Thrift library Java reader code,which expects to find exactly one type of a union to be set." Is itpossible that Jena is similarly careless when writing Thrift metadata?

This happened with Fuseki version 4.6.1, since we did the install justbefore the 4.7.0 release. I've just upgraded one of the machines to4.7.0 to see if it makes a difference. I can see that libthrift wasupdated from 0.16.0 to 0.17.0 in PR #1570, which happened in between thetwo Jena releases. It's possible that the problem has already been fixedthere. In that case, I'm really sorry for the noise.

Is there anything I could do to help debug the problem? For now I willjust keep monitoring the Fuseki instances to see if this happens again,especially with the new version.


Information about the setup:

OS: Rocky Linux 9.1 (RHEL based)
Kernel/arch: 5.14.0 x86_64
Java: openjdk version "11.0.18" 2023-01-17 LTS

Cheers,
Osma


[1] https://gist.github.com/osma/d61281160e84ea74e9d7dbc155ffaf69

[2] https://gist.github.com/jeffreycwitt/e7c270aae46f403845c87aa57e4b82af

[3] https://issues.apache.org/jira/browse/IMPALA-8252

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Thrift problem / corruption on large TDB2 Fuseki dataset

Reply via email to