Hi Andy,
I have made a minimal test
<https://github.com/pan-dora/ldp-client/blob/async/src/test/java/cool/pandora/ldpclient/FusekiConcurrencyTest.java>
that can reproduce the issue. It depends directly on the
jdk.incubator.httpclient and is therefore quite lightweight. If you
prefer, I can put it in a separate test repository.
The way that it works is it makes two arrays of 400 iterations of
CompletableFuture<?>[], one for POST and the other for GET. The two arrays
are combined and then executed with CompletableFuture.allOf(..).join();
The GET is a basic SPARQL query ("SELECT * WHERE {?s ?p ?o}") and the POST
inserts a randomized triple ("INSERT DATA { <" + pid + "> <urn:x:p>
<urn:x:o>}")
I hope this helps. It should produce the issue for any Fuseki 3.6.0
deployment.
Thank you again for your consideration.
Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig
On 1 April 2018 at 15:09, Christopher Johnson <[email protected]> wrote:
> Hi Andy,
>
> I have been able to reproduce the problem in a standard non-docker Jetty
> 9.4 deployment of fuseki-3.6.0 Thus, I think it is safe to rule out docker
> and Tomcat as causes.
>
> To further reduce causality, it would help to write a test that could
> reproduce trellis-triplestore ResourceService
> <https://github.com/trellis-ldp/trellis/blob/master/trellis-triplestore/src/main/java/org/trellisldp/triplestore/TriplestoreResourceService.java>
> operations for a PUT request without using RDFConnection. This would
> involve potentially parallel and dynamic GET and POST operations to Fuseki.
>
> There is not much advantage of parallel operations for Fuseki+TDB - some
>> protocol parallelism - because by the time it gets to the TDB storage,
>> there is one active writer at a time.
>
>
> That seems true. The advantages of parallel operations are negligible at
> storage. I tend to think that there are simply practical benefits to
> non-blocking async IO for clients who need to batch and control large
> numbers of over-the-wire update requests. This point of view is certainly
> debatable.
>
> I CC Aaron Coburn (the TrellisLDP developer) in this email, and perhaps he
> can help with the test case.
>
> -Christopher
>
>
> On 31 March 2018 at 19:53, Andy Seaborne <[email protected]> wrote:
>
>> Christopher,
>>
>> I'm not seeing any 500 responses in the 3.6.0 logs - which is odd in
>> itself - and the update operations are SPARQL updates using POST. No PUTs.
>> Did you mean these are in the Trellis logs?
>>
>> If it is concurrency, it should be possible to simulate by shell script
>> hitting the server with parallel requests as long as the requests cause
>> change in the database.
>>
>> There is not much advantage of parallel operations for Fuseki+TDB - some
>> protocol parallelism - because by the time it gets to the TDB storage,
>> there is one active writer at a time.
>>
>> An issue of alloc-write in docker setups has been reported before. That
>> was with shared folders and the setup of the docker filesystem used.
>>
>> Underlying problems with the host filesystem are not going to get
>> reported nicely (e.g lack of space). Also - the TDB database locking will
>> not work perfectly on docker and any shared access, even if a process on
>> the host is only reading the database (that includes backup software) may
>> interfere.
>>
>> The JIRA resolved tickets for 3.5.0:
>> https://s.apache.org/jena-3.5.0-jira
>>
>> JENA-1379 is the most obvious TDB related one but if that is the cause
>> then it could be that 3.4.0 is silently going wrong.
>>
>> Andy
>>
>> On 31/03/18 12:19, Christopher Johnson wrote:
>>
>>> Hi,
>>>
>>> Thank you for the reply. It is a relatively complex test method to
>>> reproduce, unfortunately, because it requires JDK 10 with the
>>> jdk.incubator.httpclient. The client also connects to the app with
>>> HTTP/2. I suspect that any asynchronous client, however, could produce a
>>> similar result.
>>>
>>> Here are some details:
>>> 1. TrellisLDP has configuration options to write to an in memory or TDB
>>> filesystem dataset. Both of these options work with asynchronous writes,
>>> and TrellisLDP depends on jena-osgi and jena-tdb2 version 3.6.0.
>>> 2. Fuseki is deployed in a Tomcat 9 container and trellis-app connects to
>>> it over HTTPS using RDFConnection.connect("https:/
>>> /fuseki:8443/fuseki/trellis"). (I do not suspect that Tomcat or the
>>> protocol are consequential to the problem).
>>> 3. The tests always start from a clean dataset.
>>> 4. The client test does repeated PUTs of a 5k size n-triples dataset (400
>>> iterations).
>>> 5. The client test is run in JUnit5 in a JDK 10 JVM on localhost.
>>> 6. The trellis-app and Fuseki deployment is dockerized. The Fuseki
>>> dataset
>>> directory is bound to the localhost filesystem. The container
>>> environments
>>> are initialized clean for each run, which facilitates reproducible test
>>> output.
>>>
>>> I have committed server and client access logs from a 3.4.0 deployment
>>> and
>>> a 3.6.0 deployment here:
>>> https://github.com/pan-dora/ldp-client/tree/async/fuseki-issue
>>>
>>> The 3.4.0 test starts at 12:19:05.807 (Central European Summer Time).
>>> The
>>> access log records 400 PUT requests all with a 201 response in 25
>>> seconds)
>>> The 3.6.0 test starts at 12:27:18.161 (Central European Summer Time).
>>> The
>>> access log records 266 PUT requests with a 201 response and 134 with a
>>> 500
>>> response in 34 seconds)
>>>
>>> The first exception occurs 23 seconds into the test (line 12369 in
>>> fuseki-3.6.0-async.txt):
>>>
>>> [2018-03-31 10:27:41] BindingTDB ERROR get1(?object)
>>>> org.apache.jena.tdb.base.file.FileException: In the middle of an
>>>> alloc-write
>>>>
>>>
>>>
>>> Thank you again for your assistance, and I am glad to do further
>>> diagnostics. I have considered applying incremental commit patches to a
>>> 3.4.0 build to isolate the changeset(s) as the next course of action.
>>>
>>> Christopher Johnson
>>> Scientific Associate
>>> Universitätsbibliothek Leipzig
>>>
>>>
>>> On 31 March 2018 at 09:26, Andy Seaborne <[email protected]> wrote:
>>>
>>> Hi there,
>>>>
>>>> The log files would be useful to know what the server was doing at the
>>>> time and to see the stacktraces. The ideal is a complete, minimal
>>>> example
>>>> which we can use to investigate. Without a runnable setup, its a bit
>>>> harder to investigate.
>>>>
>>>> Some questions to help me understand what is happening:
>>>>
>>>> What does the the code in pan-dora/ldp-client is do? If it is concurrent
>>>> operations, what else is the client doing at the time?
>>>>
>>>> Does it work if you go back to 3.4.0 after using 3.5.0 where it went
>>>> wrong?
>>>>
>>>> What's the Fuseki server configuration?
>>>>
>>>> What's the history of the TDB database? How long ago was it created and
>>>> has it been used outside of Fuseki or in different Fuseki
>>>> configurations?
>>>>
>>>> JENA-1482 (which isn't in a released version of Jena) is a few
>>>> diagnostics
>>>> to catch error situations earlier.
>>>>
>>>> Andy
>>>>
>>>> On 30/03/18 07:06, Christopher Johnson wrote:
>>>>
>>>> Hi list,
>>>>>
>>>>> I apologize, but on further evaluation, I have determined that it
>>>>> definitely works in 3.4.0 but not in 3.5.0. So, the scope of changes
>>>>> (since 2017 July) is a bit broader.
>>>>> Also, for reference, here is a link to the client method
>>>>> <https://github.com/pan-dora/ldp-client/blob/master/src/main
>>>>> /java/cool/pandora/ldpclient/LdpClientImpl.java#L875-L879>.
>>>>> If you need detailed server logs, please let me know.
>>>>>
>>>>> Christopher Johnson
>>>>> Scientific Associate
>>>>> Universitätsbibliothek Leipzig
>>>>>
>>>>> On 30 March 2018 at 07:08, Christopher Johnson <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi list,
>>>>>
>>>>>>
>>>>>> I am developing a client that writes asynchronously to Fuseki via
>>>>>> TrellisLDP <https://github.com/trellis-ldp/trellis>. There seems to
>>>>>> be
>>>>>> an issue since 3.6.0 that throws a BindingTBD error.
>>>>>>
>>>>>> The exception is this:
>>>>>>
>>>>>> fuseki | [2018-03-27 17:02:44] BindingTDB ERROR get1(?parent)
>>>>>>
>>>>>>> fuseki | java.nio.BufferOverflowException
>>>>>>> fuseki | at java.nio.HeapByteBuffer.put(He
>>>>>>> apByteBuffer.java:214)
>>>>>>> fuseki | at sun.nio.ch.IOUtil.read(IOUtil.java:200)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I looked at the code in BindingTBD and reverted the change in
>>>>>> JENA-1482:
>>>>>> tests for null NodeIds at the TDB level" to see if a different result
>>>>>> would
>>>>>> occur.
>>>>>>
>>>>>> Now it throws this exception:
>>>>>>
>>>>>> fuseki | [2018-03-30 03:39:34] BindingTDB ERROR get1(?object)
>>>>>>
>>>>>>> fuseki | org.apache.jena.tdb.base.file.FileException: In the
>>>>>>> middle
>>>>>>> of an alloc-write
>>>>>>> fuseki | at org.apache.jena.tdb.base.objec
>>>>>>> tfile.ObjectFileStorage.
>>>>>>> read(ObjectFileStorage.java:311)
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> This looks like a concurrency issue. I do not know why it works in
>>>>>> 3.5.0. Any ideas?
>>>>>>
>>>>>> Thank you for your consideration of this.
>>>>>>
>>>>>> Christopher Johnson
>>>>>> Scientific Associate
>>>>>> Universitätsbibliothek Leipzig
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>