Hi Andy, I have been able to reproduce the problem in a standard non-docker Jetty 9.4 deployment of fuseki-3.6.0 Thus, I think it is safe to rule out docker and Tomcat as causes.
To further reduce causality, it would help to write a test that could reproduce trellis-triplestore ResourceService <https://github.com/trellis-ldp/trellis/blob/master/trellis-triplestore/src/main/java/org/trellisldp/triplestore/TriplestoreResourceService.java> operations for a PUT request without using RDFConnection. This would involve potentially parallel and dynamic GET and POST operations to Fuseki. There is not much advantage of parallel operations for Fuseki+TDB - some > protocol parallelism - because by the time it gets to the TDB storage, > there is one active writer at a time. That seems true. The advantages of parallel operations are negligible at storage. I tend to think that there are simply practical benefits to non-blocking async IO for clients who need to batch and control large numbers of over-the-wire update requests. This point of view is certainly debatable. I CC Aaron Coburn (the TrellisLDP developer) in this email, and perhaps he can help with the test case. -Christopher On 31 March 2018 at 19:53, Andy Seaborne <[email protected]> wrote: > Christopher, > > I'm not seeing any 500 responses in the 3.6.0 logs - which is odd in > itself - and the update operations are SPARQL updates using POST. No PUTs. > Did you mean these are in the Trellis logs? > > If it is concurrency, it should be possible to simulate by shell script > hitting the server with parallel requests as long as the requests cause > change in the database. > > There is not much advantage of parallel operations for Fuseki+TDB - some > protocol parallelism - because by the time it gets to the TDB storage, > there is one active writer at a time. > > An issue of alloc-write in docker setups has been reported before. That > was with shared folders and the setup of the docker filesystem used. > > Underlying problems with the host filesystem are not going to get reported > nicely (e.g lack of space). Also - the TDB database locking will not work > perfectly on docker and any shared access, even if a process on the host is > only reading the database (that includes backup software) may interfere. > > The JIRA resolved tickets for 3.5.0: > https://s.apache.org/jena-3.5.0-jira > > JENA-1379 is the most obvious TDB related one but if that is the cause > then it could be that 3.4.0 is silently going wrong. > > Andy > > On 31/03/18 12:19, Christopher Johnson wrote: > >> Hi, >> >> Thank you for the reply. It is a relatively complex test method to >> reproduce, unfortunately, because it requires JDK 10 with the >> jdk.incubator.httpclient. The client also connects to the app with >> HTTP/2. I suspect that any asynchronous client, however, could produce a >> similar result. >> >> Here are some details: >> 1. TrellisLDP has configuration options to write to an in memory or TDB >> filesystem dataset. Both of these options work with asynchronous writes, >> and TrellisLDP depends on jena-osgi and jena-tdb2 version 3.6.0. >> 2. Fuseki is deployed in a Tomcat 9 container and trellis-app connects to >> it over HTTPS using RDFConnection.connect("https:/ >> /fuseki:8443/fuseki/trellis"). (I do not suspect that Tomcat or the >> protocol are consequential to the problem). >> 3. The tests always start from a clean dataset. >> 4. The client test does repeated PUTs of a 5k size n-triples dataset (400 >> iterations). >> 5. The client test is run in JUnit5 in a JDK 10 JVM on localhost. >> 6. The trellis-app and Fuseki deployment is dockerized. The Fuseki >> dataset >> directory is bound to the localhost filesystem. The container >> environments >> are initialized clean for each run, which facilitates reproducible test >> output. >> >> I have committed server and client access logs from a 3.4.0 deployment and >> a 3.6.0 deployment here: >> https://github.com/pan-dora/ldp-client/tree/async/fuseki-issue >> >> The 3.4.0 test starts at 12:19:05.807 (Central European Summer Time). The >> access log records 400 PUT requests all with a 201 response in 25 seconds) >> The 3.6.0 test starts at 12:27:18.161 (Central European Summer Time). The >> access log records 266 PUT requests with a 201 response and 134 with a 500 >> response in 34 seconds) >> >> The first exception occurs 23 seconds into the test (line 12369 in >> fuseki-3.6.0-async.txt): >> >> [2018-03-31 10:27:41] BindingTDB ERROR get1(?object) >>> org.apache.jena.tdb.base.file.FileException: In the middle of an >>> alloc-write >>> >> >> >> Thank you again for your assistance, and I am glad to do further >> diagnostics. I have considered applying incremental commit patches to a >> 3.4.0 build to isolate the changeset(s) as the next course of action. >> >> Christopher Johnson >> Scientific Associate >> Universitätsbibliothek Leipzig >> >> >> On 31 March 2018 at 09:26, Andy Seaborne <[email protected]> wrote: >> >> Hi there, >>> >>> The log files would be useful to know what the server was doing at the >>> time and to see the stacktraces. The ideal is a complete, minimal >>> example >>> which we can use to investigate. Without a runnable setup, its a bit >>> harder to investigate. >>> >>> Some questions to help me understand what is happening: >>> >>> What does the the code in pan-dora/ldp-client is do? If it is concurrent >>> operations, what else is the client doing at the time? >>> >>> Does it work if you go back to 3.4.0 after using 3.5.0 where it went >>> wrong? >>> >>> What's the Fuseki server configuration? >>> >>> What's the history of the TDB database? How long ago was it created and >>> has it been used outside of Fuseki or in different Fuseki configurations? >>> >>> JENA-1482 (which isn't in a released version of Jena) is a few >>> diagnostics >>> to catch error situations earlier. >>> >>> Andy >>> >>> On 30/03/18 07:06, Christopher Johnson wrote: >>> >>> Hi list, >>>> >>>> I apologize, but on further evaluation, I have determined that it >>>> definitely works in 3.4.0 but not in 3.5.0. So, the scope of changes >>>> (since 2017 July) is a bit broader. >>>> Also, for reference, here is a link to the client method >>>> <https://github.com/pan-dora/ldp-client/blob/master/src/main >>>> /java/cool/pandora/ldpclient/LdpClientImpl.java#L875-L879>. >>>> If you need detailed server logs, please let me know. >>>> >>>> Christopher Johnson >>>> Scientific Associate >>>> Universitätsbibliothek Leipzig >>>> >>>> On 30 March 2018 at 07:08, Christopher Johnson <[email protected]> >>>> wrote: >>>> >>>> Hi list, >>>> >>>>> >>>>> I am developing a client that writes asynchronously to Fuseki via >>>>> TrellisLDP <https://github.com/trellis-ldp/trellis>. There seems to >>>>> be >>>>> an issue since 3.6.0 that throws a BindingTBD error. >>>>> >>>>> The exception is this: >>>>> >>>>> fuseki | [2018-03-27 17:02:44] BindingTDB ERROR get1(?parent) >>>>> >>>>>> fuseki | java.nio.BufferOverflowException >>>>>> fuseki | at java.nio.HeapByteBuffer.put(He >>>>>> apByteBuffer.java:214) >>>>>> fuseki | at sun.nio.ch.IOUtil.read(IOUtil.java:200) >>>>>> >>>>>> >>>>> >>>>> I looked at the code in BindingTBD and reverted the change in >>>>> JENA-1482: >>>>> tests for null NodeIds at the TDB level" to see if a different result >>>>> would >>>>> occur. >>>>> >>>>> Now it throws this exception: >>>>> >>>>> fuseki | [2018-03-30 03:39:34] BindingTDB ERROR get1(?object) >>>>> >>>>>> fuseki | org.apache.jena.tdb.base.file.FileException: In the >>>>>> middle >>>>>> of an alloc-write >>>>>> fuseki | at org.apache.jena.tdb.base.objec >>>>>> tfile.ObjectFileStorage. >>>>>> read(ObjectFileStorage.java:311) >>>>>> >>>>>> >>>>> >>>>> This looks like a concurrency issue. I do not know why it works in >>>>> 3.5.0. Any ideas? >>>>> >>>>> Thank you for your consideration of this. >>>>> >>>>> Christopher Johnson >>>>> Scientific Associate >>>>> Universitätsbibliothek Leipzig >>>>> >>>>> >>>>> >>>> >>
