Hi all,
we've recently rebuilt our two main virtual servers running Fuseki which
are the backend databases of the Finto.fi vocabulary service. After
running the new servers for a few weeks we've already seen two cases,
one on each server, where some of the SPARQL queries start failing.
In our setup, Fuseki is managing a relatively large TDB2 database along
with a jena-text index. We keep the data in around 50 named graphs (one
per vocabulary) and each graph is typically updated using s-put,
replacing the whole graph in-place. When all data is initially loaded,
the database directory takes around 25GB on one server and 44GB on the
other. TDB2 tends to keep growing over time though, so around once a
month we delete the whole database and rebuild it from RDF source files
that we keep under version control; the master data doesn't reside
within Fuseki.
When queries start failing, the Fuseki logs show long tracebacks, but
the beef seems to be these two exceptions:
org.apache.jena.tdb2.TDBException: NodeTableTRDF/Read
Caused by: org.apache.thrift.protocol.TProtocolException: Unrecognized type 0
I've put the whole tracebacks into a gist [1]: there is one traceback
for a failed SELECT query and another for a failed update query.
In both cases, the database size had grown to over 100GB when this
happened. A restart of Fuseki didn't help, but rebuilding the whole
database made the problem go away - for now. But this is making me
worried that it could happen again any time. I'm sorry that I don't have
an easy way of reproducing the problem at this time, as it only seems to
happen after Fuseki has been running for a few weeks and done many
different update operations on the TDB2 dataset.
I searched online for similar issues and exceptions. I could only find
one gist [2] showing a very similar traceback, which apparently happened
when running a compact operation on Fuseki 4.3.1. There was also a
similar issue [3] reported in Apache Impala, and there the problem was
fixed by adding more careful checks into the code writing Thrift node
metadata. But that code is written in C++, so it's a bit hard to compare
that to the Jena codebase.
The Impala issue says: "Since IMPALA-1048 we write
TRuntimeProfileNode.node_metadata unconditionally, even when both its
fields are unset. This trips up the Thrift library Java reader code,
which expects to find exactly one type of a union to be set." Is it
possible that Jena is similarly careless when writing Thrift metadata?
This happened with Fuseki version 4.6.1, since we did the install just
before the 4.7.0 release. I've just upgraded one of the machines to
4.7.0 to see if it makes a difference. I can see that libthrift was
updated from 0.16.0 to 0.17.0 in PR #1570, which happened in between the
two Jena releases. It's possible that the problem has already been fixed
there. In that case, I'm really sorry for the noise.
Is there anything I could do to help debug the problem? For now I will
just keep monitoring the Fuseki instances to see if this happens again,
especially with the new version.
Information about the setup:
OS: Rocky Linux 9.1 (RHEL based)
Kernel/arch: 5.14.0 x86_64
Java: openjdk version "11.0.18" 2023-01-17 LTS
Cheers,
Osma
[1] https://gist.github.com/osma/d61281160e84ea74e9d7dbc155ffaf69
[2] https://gist.github.com/jeffreycwitt/e7c270aae46f403845c87aa57e4b82af
[3] https://issues.apache.org/jira/browse/IMPALA-8252
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi