Christopher,

I'm not seeing any 500 responses in the 3.6.0 logs - which is odd in itself - and the update operations are SPARQL updates using POST. No PUTs. Did you mean these are in the Trellis logs?

If it is concurrency, it should be possible to simulate by shell script hitting the server with parallel requests as long as the requests cause change in the database.

There is not much advantage of parallel operations for Fuseki+TDB - some protocol parallelism - because by the time it gets to the TDB storage, there is one active writer at a time.

An issue of alloc-write in docker setups has been reported before. That was with shared folders and the setup of the docker filesystem used.

Underlying problems with the host filesystem are not going to get reported nicely (e.g lack of space). Also - the TDB database locking will not work perfectly on docker and any shared access, even if a process on the host is only reading the database (that includes backup software) may interfere.

The JIRA resolved tickets for 3.5.0:
https://s.apache.org/jena-3.5.0-jira

JENA-1379 is the most obvious TDB related one but if that is the cause then it could be that 3.4.0 is silently going wrong.

    Andy

On 31/03/18 12:19, Christopher Johnson wrote:
Hi,

Thank you for the reply.  It is a relatively complex test method to
reproduce, unfortunately, because it requires JDK 10 with the
jdk.incubator.httpclient.  The client also connects to the app with
HTTP/2.  I suspect that any asynchronous client, however, could produce a
similar result.

Here are some details:
1. TrellisLDP has configuration options to write to an in memory or TDB
filesystem dataset.  Both of these options work with asynchronous writes,
and TrellisLDP depends on jena-osgi and jena-tdb2 version 3.6.0.
2. Fuseki is deployed in a Tomcat 9 container and trellis-app connects to
it over HTTPS using RDFConnection.connect("https:/
/fuseki:8443/fuseki/trellis"). (I do not suspect that Tomcat or the
protocol are consequential to the problem).
3. The tests always start from a clean dataset.
4. The client test does repeated PUTs of a 5k size n-triples dataset (400
iterations).
5. The client test is run in JUnit5 in a JDK 10 JVM on localhost.
6. The trellis-app and Fuseki deployment is dockerized.  The Fuseki dataset
directory is bound to the localhost filesystem.  The container environments
are initialized clean for each run, which facilitates reproducible test
output.

I have committed server and client access logs from a 3.4.0 deployment and
a 3.6.0 deployment here:
https://github.com/pan-dora/ldp-client/tree/async/fuseki-issue

The 3.4.0 test starts at 12:19:05.807 (Central European Summer Time).  The
access log records 400 PUT requests all with a 201 response in 25 seconds)
The 3.6.0 test starts at 12:27:18.161 (Central European Summer Time).  The
access log records 266 PUT requests with a 201 response and 134 with a 500
response in 34 seconds)

The first exception occurs 23 seconds into the test (line 12369 in
fuseki-3.6.0-async.txt):

[2018-03-31 10:27:41] BindingTDB ERROR get1(?object)
org.apache.jena.tdb.base.file.FileException: In the middle of an
alloc-write


Thank you again for your assistance, and I am glad to do further
diagnostics.  I have considered applying incremental commit patches to a
3.4.0 build to isolate the changeset(s) as the next course of action.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig


On 31 March 2018 at 09:26, Andy Seaborne <[email protected]> wrote:

Hi there,

The log files would be useful to know what the server was doing at the
time and to see the stacktraces.  The ideal is a complete, minimal example
which we can use to investigate.  Without a runnable setup, its a bit
harder to investigate.

Some questions to help me understand what is happening:

What does the the code in pan-dora/ldp-client is do? If it is concurrent
operations, what else is the client doing at the time?

Does it work if you go back to 3.4.0 after using 3.5.0 where it went wrong?

What's the Fuseki server configuration?

What's the history of the TDB database? How long ago was it created and
has it been used outside of Fuseki or in different Fuseki configurations?

JENA-1482 (which isn't in a released version of Jena) is a few diagnostics
to catch error situations earlier.

     Andy

On 30/03/18 07:06, Christopher Johnson wrote:

Hi list,

I apologize, but on further evaluation, I have determined that it
definitely works in 3.4.0 but not in 3.5.0.  So, the scope of changes
(since 2017 July) is a bit broader.
Also, for reference, here is a link to the client method
<https://github.com/pan-dora/ldp-client/blob/master/src/main
/java/cool/pandora/ldpclient/LdpClientImpl.java#L875-L879>.
If you need detailed server logs, please let me know.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig

On 30 March 2018 at 07:08, Christopher Johnson <[email protected]>
wrote:

Hi list,

I am developing a client that writes asynchronously to Fuseki via
TrellisLDP <https://github.com/trellis-ldp/trellis>.  There seems to be
an issue since 3.6.0 that throws a BindingTBD error.

The exception is this:

fuseki     | [2018-03-27 17:02:44] BindingTDB ERROR get1(?parent)
fuseki     | java.nio.BufferOverflowException
fuseki     |    at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:214)
fuseki     |    at sun.nio.ch.IOUtil.read(IOUtil.java:200)



I looked at the code in BindingTBD and reverted the change in JENA-1482:
tests for null NodeIds at the TDB level" to see if a different result
would
occur.

Now it throws this exception:

fuseki     | [2018-03-30 03:39:34] BindingTDB ERROR get1(?object)
fuseki     | org.apache.jena.tdb.base.file.FileException: In the middle
of an alloc-write
fuseki     |    at org.apache.jena.tdb.base.objec
tfile.ObjectFileStorage.
read(ObjectFileStorage.java:311)



This looks like a concurrency issue.  I do not know why it works in
3.5.0.  Any ideas?

Thank you for your consideration of this.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig




Reply via email to