There may be two things going on ... or one
The WARN
This is broken input data but from looking at the Jena code, I'm not
seeing how a encoding error on it's own can sent an unset RDF_StreamRow
(which is what prints as <RDF_StreamRow >) The Jena code is quite simple
and by inspection I can't see a path that sends that. And in Thrift you
can send unset data anyway - the write path throws a client side a
exception.
So I'm wondering about the path from client to server especially with
the broken pipes earlier which might impact a reverse proxy or tomcat
itself. Maybe there is junk on the wire.
The W-transactions
This is TDB1 presumably.
The lack of response suggests they are backing up and that can happen in
TDB1 if the engine database is blocked from writing to the storage.
that said - it does log if that is happening unless I/O is blocked at
the OS level.
On 02/07/2020 18:37, Chris Tomlinson wrote:
Hi Andy,
On Jul 2, 2020, at 10:22 AM, Andy Seaborne <[email protected]> wrote:
The log done not contain "Unrecognized".
Is the "ThriftConvert WARN visit:" message from a different process?
Sorry. I should have been much clearer:
I have attached the catalina.out with the 199M or so "ThriftConvert WARN”s
removed.
I wasn't clear if "removed" meant stderr another source or not.
The event starts at
line 11305 "[2020-06-30 14:12:51] Fuseki INFO [120611] PUT
http://buda1. . . .”
with the first WARNing at 14:12:53 after the "[120611] 200 OK (2.106 s)”;
and ends with the stacktraces resulting from “systemctl restart fuseki.service”
at :
line 12940 "30-Jun-2020 16:21:28.445"
and the stacktraces end at startup:
line 12939 "[2020-06-30 16:21:26] Fuseki INFO [121275] PUT http://buda1. .
. . "
I have a 55.1MB gzip’d file with all of the ~199M "ThriftConvert WARN”s. It is
15GB+ uncompressed. The rate of the WARNing messages is around 20K - 25K per second.
If you need the entire file, where should I upload it?
From what I can see, it (WARN) isn't database related.
There are some broken pipe errors much earlier but they happen if clients go
away untidily. They are on query responses not PUTs.
The fact 120612 doesn't even print "Body:”
I think we conclude from this that the RDFConnection became unstable processing
the incoming payload of 7536 triples in the ontologySchema graph. We have
significantly larger graphs that are PUT using the same pattern as I displayed
earlier.
The same source bits/bytes loaded w/o issue before and after this issue so it
seems likely that there was an undetected transmission error that RDFConnection
did not get notified of and was not able to recover from. May be that’s
peculiar to the Thrift encoding?
Caveat a bug in thrift, no. Either it works or it doesn't and your
system works most of the time.
You could run with RDFConnectionRemote - what RDFConnectionFuseki does is
switch to thrift encodings, RDFConnectionRemote uses the standards ones.
We’ll take this under consideration. The problem occurred once at the end of
April and now at the end of June. Not very frequent given the amount of PUTs
that we do on a daily basis. It’s not what I would call effectively repeatable.
Is there a reason that we should distrust Thrift in general? Is it somehow more
susceptible to undetected transmission errors that the standard encodings?
Not that I know of. I'm suggesting it to change the situation to get
more information.
Do we need to consider avoiding Thrift everywhere?
If it works for months, and it is a simple code path, suggests to me it
is working.
Even the concurrency looks OK because it locally writes a buffer so the
HTTP length is available.
Andy
Thanks,
Chris
Andy
On 01/07/2020 17:34, Chris Tomlinson wrote:
Hi Andy,
On Jul 1, 2020, at 7:59 AM, Andy Seaborne <[email protected]
<mailto:[email protected]>> wrote:
Presumably the client is in java using RDFConnectionFactory.connectFuseki?
Yes the clients are in java 1.8 also on Debian 4.9. Fuseki is running on one
ec2 instance and some of the clients are on the same instance and others are on
another ec2 instance on the same region. The clients use the following patterns:
fuConnBuilder = RDFConnectionFuseki.create()
.destination(baseUrl)
.queryEndpoint(baseUrl+"/query")
.gspEndpoint(baseUrl+"/data")
.updateEndpoint(baseUrl+"/update”);
fuConn = fuConnBuilder.build();
if (!fuConn.isInTransaction()) {
fuConn.begin(ReadWrite.WRITE);
}
fuConn.put(graphName, model);
fuConn.commit();
Do you have the data from 120611?
The data for the PUT of 120611, can be retrieved via
http://purl.bdrc.io/graph/PersonUIShapes.ttl. The source is person.ui.shapes.ttl
<https://github.com/buda-base/editor-templates/blob/master/templates/core/person.ui.shapes.ttl>
which is loaded via OntDocumentManager w/ setProcessImports(true) from GH.
From the log, 120611 appears to have completed successfully with 839 triples
(the expected number); however, since then there have been several reloads of
that graph during restarts and so on - we’re in development - so the exact bits
in the 120611 PUT are not specifically available. This particular data has not
changed in the past couple of weeks.
As for 120612 it never completed and the source data has not had any changes
since 2020-06-29 prior to the “failure” and after and there have been numerous
reloads w/o issue before and since. Only during the period 14:12 - 16:21 do we
see the PUTs hang.
Could the request have got truncated?
Referring to 120611, I don’t see any evidence of truncation. The expected
number of triples, 839, is reported in the log.
The fact later PUTs stall suggest (does not prove) that the earlier operation
broken somehow but I'd expect a stacktrace in the log.
There may be a stacktrace at the very end when I restarted fuseki. There were
quite a few threads aborted w/ stacktraces but I don’t know how to interpret
which might have been associated with 120612 which appears to have never
completed receiving ontologySchema since no
Fuseki INFO [120612] Body: . . .
appears in the log. There are another 13 occurrences of PUTs which hang during
the period 14:12 - 16:21.
I have attached the catalina.out with the 199M or so "ThriftConvert WARN”s
removed. It shows the stacktraces resulting from the “systemctl restart
fuseki.service” at 30-Jun-2020 16:21:28.445 line# 12940 to the startup at
30-Jun-2020 16:21:34.866 line# 18856.
The log also shows that just the PUTs hung, GETs and POSTs (of queries)
continued to be processed, perhaps a little less rapidly than usual.
Overlapping should be fine - as that are the same database, one will happen
before the other even though they are separate graphs.
As we expected.
It seems that this broken behavior is an unusual occurrence.
Thank you for your help. It’s rather perplexing.
Chris
Andy
On 30/06/2020 22:29, Chris Tomlinson wrote:
Hello,
Running jena/fuseki 3.14.0, compiled and running under openjdk version
“1.8.0_252”.
Just some hours ago fuseki running on an AWS ec2 for almost a year, logged
around 199M occurrences of (times are UTC):
[2020-06-30 14:12:50] Fuseki INFO [120610] 200 OK (8 ms)
[2020-06-30 14:12:51] Fuseki INFO [120611] PUT
http://buda1.bdrc.io:13180/fuseki/corerw/data?graph=http://purl.bdrc.io/graph/PersonUIShapes
[2020-06-30 14:12:52] Fuseki INFO [120612] PUT
http://buda1.bdrc.io:13180/fuseki/corerw/data?graph=http://purl.bdrc.io/graph/ontologySchema
[2020-06-30 14:12:53] Fuseki INFO [120611] Body: Content-Length=73984,
Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : Count=839
Triples=839 Quads=0
[2020-06-30 14:12:53] Fuseki INFO [120611] 200 OK (2.106 s)
[2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
. . .
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
[2020-06-30 16:21:33] ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >
30-Jun-2020 16:21:34.866 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log Server version:
Apache Tomcat/8.0.53
30-Jun-2020 16:21:34.868 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log Server built:
Jun 29 2018 14:42:45 UTC
30-Jun-2020 16:21:34.868 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log Server number:
8.0.53.0
30-Jun-2020 16:21:34.868 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log OS Name:
Linux
30-Jun-2020 16:21:34.868 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log OS Version:
4.9.0-8-amd64
30-Jun-2020 16:21:34.868 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log Architecture:
amd64
30-Jun-2020 16:21:34.869 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log Java Home:
/usr/lib/jvm/java-8-openjdk-amd64/jre
30-Jun-2020 16:21:34.869 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log JVM Version:
1.8.0_252-8u252-b09-1~deb9u1-b09
30-Jun-2020 16:21:34.869 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log JVM Vendor:
Oracle Corporation
30-Jun-2020 16:21:34.869 INFO [main]
org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE:
/usr/local/fuseki/tomcat
Until I saw the issue and restarted fuseki.
An interesting thing I see in the trace is the overlapping PUTs just as fuseki
wigged out. There were no changes in code or the graphs that were the being PUT
as part of a restart of an app server.
Fuseki didn't go completely unresponsive except that later PUTs were
indefinitely delayed - which was how I noticed the issue.
We saw this once before in 29 April and didn’t delve into it at the time.
Has anyone seen this sort of thing before?
Thanks,
Chris