Hi Andy,

I have been able to reproduce the problem in a standard non-docker Jetty
9.4 deployment of fuseki-3.6.0  Thus, I think it is safe to rule out docker
and Tomcat as causes.

To further reduce causality, it would help to write a test that could
reproduce trellis-triplestore ResourceService
<https://github.com/trellis-ldp/trellis/blob/master/trellis-triplestore/src/main/java/org/trellisldp/triplestore/TriplestoreResourceService.java>
operations for a PUT request without using RDFConnection.  This would
involve potentially parallel and dynamic GET and POST operations to Fuseki.

There is not much advantage of parallel operations for Fuseki+TDB - some
> protocol parallelism - because by the time it gets to the TDB storage,
> there is one active writer at a time.


That seems true.  The advantages of parallel operations are negligible at
storage.  I tend to think that there are simply practical benefits to
non-blocking async IO for clients who need to batch and control large
numbers of over-the-wire update requests.  This point of view is certainly
debatable.

I CC Aaron Coburn (the TrellisLDP developer) in this email, and perhaps he
can help with the test case.

-Christopher


On 31 March 2018 at 19:53, Andy Seaborne <[email protected]> wrote:

> Christopher,
>
> I'm not seeing any 500 responses in the 3.6.0 logs - which is odd in
> itself - and the update operations are SPARQL updates using POST. No PUTs.
> Did you mean these are in the Trellis logs?
>
> If it is concurrency, it should be possible to simulate by shell script
> hitting the server with parallel requests as long as the requests cause
> change in the database.
>
> There is not much advantage of parallel operations for Fuseki+TDB - some
> protocol parallelism - because by the time it gets to the TDB storage,
> there is one active writer at a time.
>
> An issue of alloc-write in docker setups has been reported before. That
> was with shared folders and the setup of the docker filesystem used.
>
> Underlying problems with the host filesystem are not going to get reported
> nicely (e.g lack of space).  Also - the TDB database locking will not work
> perfectly on docker and any shared access, even if a process on the host is
> only reading the database (that includes backup software) may interfere.
>
> The JIRA resolved tickets for 3.5.0:
> https://s.apache.org/jena-3.5.0-jira
>
> JENA-1379 is the most obvious TDB related one but if that is the cause
> then it could be that 3.4.0 is silently going wrong.
>
>     Andy
>
> On 31/03/18 12:19, Christopher Johnson wrote:
>
>> Hi,
>>
>> Thank you for the reply.  It is a relatively complex test method to
>> reproduce, unfortunately, because it requires JDK 10 with the
>> jdk.incubator.httpclient.  The client also connects to the app with
>> HTTP/2.  I suspect that any asynchronous client, however, could produce a
>> similar result.
>>
>> Here are some details:
>> 1. TrellisLDP has configuration options to write to an in memory or TDB
>> filesystem dataset.  Both of these options work with asynchronous writes,
>> and TrellisLDP depends on jena-osgi and jena-tdb2 version 3.6.0.
>> 2. Fuseki is deployed in a Tomcat 9 container and trellis-app connects to
>> it over HTTPS using RDFConnection.connect("https:/
>> /fuseki:8443/fuseki/trellis"). (I do not suspect that Tomcat or the
>> protocol are consequential to the problem).
>> 3. The tests always start from a clean dataset.
>> 4. The client test does repeated PUTs of a 5k size n-triples dataset (400
>> iterations).
>> 5. The client test is run in JUnit5 in a JDK 10 JVM on localhost.
>> 6. The trellis-app and Fuseki deployment is dockerized.  The Fuseki
>> dataset
>> directory is bound to the localhost filesystem.  The container
>> environments
>> are initialized clean for each run, which facilitates reproducible test
>> output.
>>
>> I have committed server and client access logs from a 3.4.0 deployment and
>> a 3.6.0 deployment here:
>> https://github.com/pan-dora/ldp-client/tree/async/fuseki-issue
>>
>> The 3.4.0 test starts at 12:19:05.807 (Central European Summer Time).  The
>> access log records 400 PUT requests all with a 201 response in 25 seconds)
>> The 3.6.0 test starts at 12:27:18.161 (Central European Summer Time).  The
>> access log records 266 PUT requests with a 201 response and 134 with a 500
>> response in 34 seconds)
>>
>> The first exception occurs 23 seconds into the test (line 12369 in
>> fuseki-3.6.0-async.txt):
>>
>> [2018-03-31 10:27:41] BindingTDB ERROR get1(?object)
>>> org.apache.jena.tdb.base.file.FileException: In the middle of an
>>> alloc-write
>>>
>>
>>
>> Thank you again for your assistance, and I am glad to do further
>> diagnostics.  I have considered applying incremental commit patches to a
>> 3.4.0 build to isolate the changeset(s) as the next course of action.
>>
>> Christopher Johnson
>> Scientific Associate
>> Universitätsbibliothek Leipzig
>>
>>
>> On 31 March 2018 at 09:26, Andy Seaborne <[email protected]> wrote:
>>
>> Hi there,
>>>
>>> The log files would be useful to know what the server was doing at the
>>> time and to see the stacktraces.  The ideal is a complete, minimal
>>> example
>>> which we can use to investigate.  Without a runnable setup, its a bit
>>> harder to investigate.
>>>
>>> Some questions to help me understand what is happening:
>>>
>>> What does the the code in pan-dora/ldp-client is do? If it is concurrent
>>> operations, what else is the client doing at the time?
>>>
>>> Does it work if you go back to 3.4.0 after using 3.5.0 where it went
>>> wrong?
>>>
>>> What's the Fuseki server configuration?
>>>
>>> What's the history of the TDB database? How long ago was it created and
>>> has it been used outside of Fuseki or in different Fuseki configurations?
>>>
>>> JENA-1482 (which isn't in a released version of Jena) is a few
>>> diagnostics
>>> to catch error situations earlier.
>>>
>>>      Andy
>>>
>>> On 30/03/18 07:06, Christopher Johnson wrote:
>>>
>>> Hi list,
>>>>
>>>> I apologize, but on further evaluation, I have determined that it
>>>> definitely works in 3.4.0 but not in 3.5.0.  So, the scope of changes
>>>> (since 2017 July) is a bit broader.
>>>> Also, for reference, here is a link to the client method
>>>> <https://github.com/pan-dora/ldp-client/blob/master/src/main
>>>> /java/cool/pandora/ldpclient/LdpClientImpl.java#L875-L879>.
>>>> If you need detailed server logs, please let me know.
>>>>
>>>> Christopher Johnson
>>>> Scientific Associate
>>>> Universitätsbibliothek Leipzig
>>>>
>>>> On 30 March 2018 at 07:08, Christopher Johnson <[email protected]>
>>>> wrote:
>>>>
>>>> Hi list,
>>>>
>>>>>
>>>>> I am developing a client that writes asynchronously to Fuseki via
>>>>> TrellisLDP <https://github.com/trellis-ldp/trellis>.  There seems to
>>>>> be
>>>>> an issue since 3.6.0 that throws a BindingTBD error.
>>>>>
>>>>> The exception is this:
>>>>>
>>>>> fuseki     | [2018-03-27 17:02:44] BindingTDB ERROR get1(?parent)
>>>>>
>>>>>> fuseki     | java.nio.BufferOverflowException
>>>>>> fuseki     |    at java.nio.HeapByteBuffer.put(He
>>>>>> apByteBuffer.java:214)
>>>>>> fuseki     |    at sun.nio.ch.IOUtil.read(IOUtil.java:200)
>>>>>>
>>>>>>
>>>>>
>>>>> I looked at the code in BindingTBD and reverted the change in
>>>>> JENA-1482:
>>>>> tests for null NodeIds at the TDB level" to see if a different result
>>>>> would
>>>>> occur.
>>>>>
>>>>> Now it throws this exception:
>>>>>
>>>>> fuseki     | [2018-03-30 03:39:34] BindingTDB ERROR get1(?object)
>>>>>
>>>>>> fuseki     | org.apache.jena.tdb.base.file.FileException: In the
>>>>>> middle
>>>>>> of an alloc-write
>>>>>> fuseki     |    at org.apache.jena.tdb.base.objec
>>>>>> tfile.ObjectFileStorage.
>>>>>> read(ObjectFileStorage.java:311)
>>>>>>
>>>>>>
>>>>>
>>>>> This looks like a concurrency issue.  I do not know why it works in
>>>>> 3.5.0.  Any ideas?
>>>>>
>>>>> Thank you for your consideration of this.
>>>>>
>>>>> Christopher Johnson
>>>>> Scientific Associate
>>>>> Universitätsbibliothek Leipzig
>>>>>
>>>>>
>>>>>
>>>>
>>

Reply via email to