> On Jul 2, 2020, at 15:33, Andy Seaborne <a...@apache.org> wrote:
> 
> There may be two things going on ... or one
> 
> The WARN
> 
> This is broken input data but from looking at the Jena code, I'm not seeing 
> how a encoding error on it's own can sent an unset RDF_StreamRow (which is 
> what prints as <RDF_StreamRow >) The Jena code is quite simple and by 
> inspection I can't see a path that sends that. And in Thrift you can send 
> unset data anyway - the write path throws a client side a exception.
> 
> So I'm wondering about the path from client to server especially with the 
> broken pipes earlier which might impact a reverse proxy or tomcat itself. 
> Maybe there is junk on the wire.
> 
> 
> The W-transactions
> This is TDB1 presumably.
> 
> The lack of response suggests they are backing up and that can happen in TDB1 
> if the engine database is blocked from writing to the storage. that said - it 
> does log if that is happening unless I/O is blocked at the OS level.
> 
>>> On 02/07/2020 18:37, Chris Tomlinson wrote:
>> Hi Andy,
>>>> On Jul 2, 2020, at 10:22 AM, Andy Seaborne <a...@apache.org> wrote:
>>> The log done not contain "Unrecognized".
>>> Is the "ThriftConvert WARN visit:" message from a different process?
>> Sorry. I should have been much clearer:
>>>> I have attached the catalina.out with the 199M or so "ThriftConvert WARN”s 
>>>> removed.
> 
> I wasn't clear if "removed" meant stderr another source or not.

grep -v "ThriftConvert WARN visit: Unrecognized: <RDF_StreamRow >" catalina.out


> 
>> The event starts at
>>    line 11305 "[2020-06-30 14:12:51] Fuseki     INFO  [120611] PUT 
>> http://buda1. . . .”
>> with the first WARNing at 14:12:53 after the "[120611] 200 OK (2.106 s)”;
>> and ends with the stacktraces resulting from “systemctl restart 
>> fuseki.service” at :
>>    line 12940 "30-Jun-2020 16:21:28.445"
>> and the stacktraces end at startup:
>>    line 12939 "[2020-06-30 16:21:26] Fuseki     INFO  [121275] PUT 
>> http://buda1. . . . "
>> I have a 55.1MB gzip’d file with all of the ~199M "ThriftConvert WARN”s. It 
>> is 15GB+ uncompressed. The rate of the WARNing messages is around 20K - 25K 
>> per second.
>> If you need the entire file, where should I upload it?
> 
> From what I can see, it (WARN) isn't database related.

No it seems to me to be related to getting the payload off the wire.

> 
>>> There are some broken pipe errors much earlier but they happen if clients 
>>> go away untidily.  They are on query responses not PUTs.
>>> The fact 120612 doesn't even print "Body:”
>> I think we conclude from this that the RDFConnection became unstable 
>> processing the incoming payload of 7536 triples in the ontologySchema graph. 
>> We have significantly larger graphs that are PUT using the same pattern as I 
>> displayed earlier.
>> The same source bits/bytes loaded w/o issue before and after this issue so 
>> it seems likely that there was an undetected transmission error that 
>> RDFConnection did not get notified of and was not able to recover from. May 
>> be that’s peculiar to the Thrift encoding?
> 
> Caveat a bug in thrift, no.  Either it works or it doesn't and your system 
> works most of the time.

Yes this is a very intermittent problem


> 
>>> You could run with RDFConnectionRemote - what RDFConnectionFuseki does is 
>>> switch to thrift encodings, RDFConnectionRemote uses the standards ones.
>> We’ll take this under consideration. The problem occurred once at the end of 
>> April and now at the end of June. Not very frequent given the amount of PUTs 
>> that we do on a daily basis. It’s not what I would call effectively 
>> repeatable.
>> Is there a reason that we should distrust Thrift in general? Is it somehow 
>> more susceptible to undetected transmission errors that the standard 
>> encodings?
> 
> Not that I know of. I'm suggesting it to change the situation to get more 
> information.

Ok. I'm not sure we'll make a change just now as we're in the middle of a bit 
of a crunch.

> 
>> Do we need to consider avoiding Thrift everywhere?
> 
> If it works for months, and it is a simple code path, suggests to me it is 
> working.
> 
> Even the concurrency looks OK because it locally writes a buffer so the HTTP 
> length is available.

So it seems to me that there may be an opportunity for some sort of 
robustification in RDFConnection. There is evidently a loop somewhere that 
doesn't terminate, retrying the parsing repeatedly or something like that. The 
payload is finite so there wold appear to be a test that repeatedly fails but 
doesn't make progress in consuming the payload.

Thanks,
Chris


> 
>   Andy
> 
>> Thanks,
>> Chris
>>>    Andy
>>> On 01/07/2020 17:34, Chris Tomlinson wrote:
>>>> Hi Andy,
>>>>> On Jul 1, 2020, at 7:59 AM, Andy Seaborne <a...@apache.org 
>>>>> <mailto:a...@apache.org>> wrote:
>>>>> Presumably the client is in java using RDFConnectionFactory.connectFuseki?
>>>> Yes the clients are in java 1.8 also on Debian 4.9. Fuseki is running on 
>>>> one ec2 instance and some of the clients are on the same instance and 
>>>> others are on another ec2 instance on the same region. The clients use the 
>>>> following patterns:
>>>>>        fuConnBuilder = RDFConnectionFuseki.create()
>>>>>                .destination(baseUrl)
>>>>>                .queryEndpoint(baseUrl+"/query")
>>>>>                .gspEndpoint(baseUrl+"/data")
>>>>>                .updateEndpoint(baseUrl+"/update”);
>>>>>        fuConn = fuConnBuilder.build();
>>>>>        if (!fuConn.isInTransaction()) {
>>>>>            fuConn.begin(ReadWrite.WRITE);
>>>>>        }
>>>>>        fuConn.put(graphName, model);
>>>>>        fuConn.commit();
>>>>> Do you have the data from 120611?
>>>> The data for the PUT of 120611, can be retrieved via 
>>>> http://purl.bdrc.io/graph/PersonUIShapes.ttl. The source is 
>>>> person.ui.shapes.ttl 
>>>> <https://github.com/buda-base/editor-templates/blob/master/templates/core/person.ui.shapes.ttl>
>>>>  which is loaded via OntDocumentManager w/ setProcessImports(true) from GH.
>>>> From the log, 120611 appears to have completed successfully with 839 
>>>> triples (the expected number); however, since then there have been several 
>>>> reloads of that graph during restarts and so on - we’re in development - 
>>>> so the exact bits in the 120611 PUT are not specifically available. This 
>>>> particular data has not changed in the past couple of weeks.
>>>> As for 120612 it never completed and the source data has not had any 
>>>> changes since 2020-06-29 prior to the “failure” and after and there have 
>>>> been numerous reloads w/o issue before and since. Only during the period 
>>>> 14:12 - 16:21 do we see the PUTs hang.
>>>>> Could the request have got truncated?
>>>> Referring to 120611, I don’t see any evidence of truncation. The expected 
>>>> number of triples, 839, is reported in the log.
>>>>> The fact later PUTs stall suggest (does not prove) that the earlier 
>>>>> operation broken somehow but I'd expect a stacktrace in the log.
>>>> There may be a stacktrace at the very end when I restarted fuseki. There 
>>>> were quite a few threads aborted w/ stacktraces but I don’t know how to 
>>>> interpret which might have been associated with 120612 which appears to 
>>>> have never completed receiving ontologySchema since no
>>>>    Fuseki INFO [120612] Body: . . .
>>>> appears in the log. There are another 13 occurrences of PUTs which hang 
>>>> during the period 14:12 - 16:21.
>>>> I have attached the catalina.out with the 199M or so "ThriftConvert WARN”s 
>>>> removed. It shows the stacktraces resulting from the “systemctl restart 
>>>> fuseki.service” at 30-Jun-2020 16:21:28.445 line# 12940 to the startup at 
>>>> 30-Jun-2020 16:21:34.866 line# 18856.
>>>> The log also shows that just the PUTs hung, GETs and POSTs (of queries) 
>>>> continued to be processed, perhaps a little less rapidly than usual.
>>>>> Overlapping should be fine - as that are the same database, one will 
>>>>> happen before the other even though they are separate graphs.
>>>> As we expected.
>>>> It seems that this broken behavior is an unusual occurrence.
>>>> Thank you for your help. It’s rather perplexing.
>>>> Chris
>>>>>   Andy
>>>>> On 30/06/2020 22:29, Chris Tomlinson wrote:
>>>>>> Hello,
>>>>>> Running jena/fuseki 3.14.0, compiled and running under openjdk version 
>>>>>> “1.8.0_252”.
>>>>>> Just some hours ago fuseki running on an AWS ec2 for almost a year, 
>>>>>> logged around 199M occurrences of (times are UTC):
>>>>>>> [2020-06-30 14:12:50] Fuseki INFO [120610] 200 OK (8 ms)
>>>>>>> [2020-06-30 14:12:51] Fuseki INFO [120611] PUT 
>>>>>>> http://buda1.bdrc.io:13180/fuseki/corerw/data?graph=http://purl.bdrc.io/graph/PersonUIShapes
>>>>>>> [2020-06-30 14:12:52] Fuseki INFO [120612] PUT 
>>>>>>> http://buda1.bdrc.io:13180/fuseki/corerw/data?graph=http://purl.bdrc.io/graph/ontologySchema
>>>>>>> [2020-06-30 14:12:53] Fuseki INFO [120611] Body: Content-Length=73984, 
>>>>>>> Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : 
>>>>>>> Count=839 Triples=839 Quads=0
>>>>>>> [2020-06-30 14:12:53] Fuseki INFO [120611] 200 OK (2.106 s)
>>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 14:12:57] ThriftConvert WARN visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>> . . .
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> [2020-06-30 16:21:33] ThriftConvert WARN  visit: Unrecognized: 
>>>>>>> <RDF_StreamRow >
>>>>>>> 30-Jun-2020 16:21:34.866 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Server version:   
>>>>>>>      Apache Tomcat/8.0.53
>>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Server built:     
>>>>>>>      Jun 29 2018 14:42:45 UTC
>>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Server number:    
>>>>>>>      8.0.53.0
>>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log OS Name:          
>>>>>>>      Linux
>>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log OS Version:       
>>>>>>>      4.9.0-8-amd64
>>>>>>> 30-Jun-2020 16:21:34.868 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Architecture:     
>>>>>>>      amd64
>>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log Java Home:        
>>>>>>>      /usr/lib/jvm/java-8-openjdk-amd64/jre
>>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log JVM Version:      
>>>>>>>      1.8.0_252-8u252-b09-1~deb9u1-b09
>>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log JVM Vendor:       
>>>>>>>      Oracle Corporation
>>>>>>> 30-Jun-2020 16:21:34.869 INFO [main] 
>>>>>>> org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE:    
>>>>>>>      /usr/local/fuseki/tomcat
>>>>>> Until I saw the issue and restarted fuseki.
>>>>>> An interesting thing I see in the trace is the overlapping PUTs just as 
>>>>>> fuseki wigged out. There were no changes in code or the graphs that were 
>>>>>> the being PUT as part of a restart of an app server.
>>>>>> Fuseki didn't go completely unresponsive except that later PUTs were 
>>>>>> indefinitely delayed - which was how I noticed the issue.
>>>>>> We saw this once before in 29 April and didn’t delve into it at the time.
>>>>>> Has anyone seen this sort of thing before?
>>>>>> Thanks,
>>>>>> Chris

Reply via email to