Written up on JENA-1516

https://issues.apache.org/jira/browse/JENA-1516

    Andy

On 03/04/18 12:01, Andy Seaborne wrote:
Christopher,

I have managed to recreate something.  I am in the process of turning it into a debuggable example and then I can create a JIRA ticket.

Some of the alloc-write errors are different - a flag isn't volatile and it should be.  That isn't the main thing happening, it is the only error I am getting for 3.5.0 though on the few runs I have had time to do.

     Andy

On 01/04/18 22:25, Christopher Johnson wrote:
​​
Hi Andy,

I have made a minimal test
<https://github.com/pan-dora/ldp-client/blob/async/src/test/java/cool/pandora/ldpclient/FusekiConcurrencyTest.java>
that can reproduce the issue.  It depends directly on the
jdk.incubator.httpclient and is therefore quite lightweight.  If you
prefer, I can put it in a separate test repository.

The way that it works is it makes two arrays of 400 iterations of
CompletableFuture<?>[], one for POST and the other for GET.  The two arrays
are combined and then executed with CompletableFuture.allOf(..).join();
The GET is a basic SPARQL query ("SELECT * WHERE {?s ?p ?o}") and the POST
inserts a randomized triple ("INSERT DATA { <" + pid + "> <urn:x:p>
<urn:x:o>}")

I hope this helps.  It should produce the issue for any Fuseki 3.6.0
deployment.

Thank you again for your consideration.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig


On 1 April 2018 at 15:09, Christopher Johnson <[email protected]> wrote:

Hi Andy,

I have been able to reproduce the problem in a standard non-docker Jetty
9.4 deployment of fuseki-3.6.0  Thus, I think it is safe to rule out docker
and Tomcat as causes.

To further reduce causality, it would help to write a test that could
reproduce trellis-triplestore ResourceService
<https://github.com/trellis-ldp/trellis/blob/master/trellis-triplestore/src/main/java/org/trellisldp/triplestore/TriplestoreResourceService.java>
operations for a PUT request without using RDFConnection.  This would
involve potentially parallel and dynamic GET and POST operations to Fuseki.

There is not much advantage of parallel operations for Fuseki+TDB - some
protocol parallelism - because by the time it gets to the TDB storage,
there is one active writer at a time.


That seems true.  The advantages of parallel operations are negligible at
storage.  I tend to think that there are simply practical benefits to
non-blocking async IO for clients who need to batch and control large
numbers of over-the-wire update requests.  This point of view is certainly
debatable.

I CC Aaron Coburn (the TrellisLDP developer) in this email, and perhaps he
can help with the test case.

-Christopher


On 31 March 2018 at 19:53, Andy Seaborne <[email protected]> wrote:

Christopher,

I'm not seeing any 500 responses in the 3.6.0 logs - which is odd in
itself - and the update operations are SPARQL updates using POST. No PUTs.
Did you mean these are in the Trellis logs?

If it is concurrency, it should be possible to simulate by shell script
hitting the server with parallel requests as long as the requests cause
change in the database.

There is not much advantage of parallel operations for Fuseki+TDB - some
protocol parallelism - because by the time it gets to the TDB storage,
there is one active writer at a time.

An issue of alloc-write in docker setups has been reported before. That
was with shared folders and the setup of the docker filesystem used.

Underlying problems with the host filesystem are not going to get
reported nicely (e.g lack of space).  Also - the TDB database locking will not work perfectly on docker and any shared access, even if a process on the host is only reading the database (that includes backup software) may
interfere.

The JIRA resolved tickets for 3.5.0:
https://s.apache.org/jena-3.5.0-jira

JENA-1379 is the most obvious TDB related one but if that is the cause
then it could be that 3.4.0 is silently going wrong.

     Andy

On 31/03/18 12:19, Christopher Johnson wrote:

Hi,

Thank you for the reply.  It is a relatively complex test method to
reproduce, unfortunately, because it requires JDK 10 with the
jdk.incubator.httpclient.  The client also connects to the app with
HTTP/2.  I suspect that any asynchronous client, however, could produce a
similar result.

Here are some details:
1. TrellisLDP has configuration options to write to an in memory or TDB filesystem dataset.  Both of these options work with asynchronous writes,
and TrellisLDP depends on jena-osgi and jena-tdb2 version 3.6.0.
2. Fuseki is deployed in a Tomcat 9 container and trellis-app connects to
it over HTTPS using RDFConnection.connect("https:/
/fuseki:8443/fuseki/trellis"). (I do not suspect that Tomcat or the
protocol are consequential to the problem).
3. The tests always start from a clean dataset.
4. The client test does repeated PUTs of a 5k size n-triples dataset (400
iterations).
5. The client test is run in JUnit5 in a JDK 10 JVM on localhost.
6. The trellis-app and Fuseki deployment is dockerized.  The Fuseki
dataset
directory is bound to the localhost filesystem.  The container
environments
are initialized clean for each run, which facilitates reproducible test
output.

I have committed server and client access logs from a 3.4.0 deployment
and
a 3.6.0 deployment here:
https://github.com/pan-dora/ldp-client/tree/async/fuseki-issue

The 3.4.0 test starts at 12:19:05.807 (Central European Summer Time).
The
access log records 400 PUT requests all with a 201 response in 25
seconds)
The 3.6.0 test starts at 12:27:18.161 (Central European Summer Time).
The
access log records 266 PUT requests with a 201 response and 134 with a
500
response in 34 seconds)

The first exception occurs 23 seconds into the test (line 12369 in
fuseki-3.6.0-async.txt):

[2018-03-31 10:27:41] BindingTDB ERROR get1(?object)
org.apache.jena.tdb.base.file.FileException: In the middle of an
alloc-write



Thank you again for your assistance, and I am glad to do further
diagnostics.  I have considered applying incremental commit patches to a
3.4.0 build to isolate the changeset(s) as the next course of action.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig


On 31 March 2018 at 09:26, Andy Seaborne <[email protected]> wrote:

Hi there,

The log files would be useful to know what the server was doing at the
time and to see the stacktraces.  The ideal is a complete, minimal
example
which we can use to investigate.  Without a runnable setup, its a bit
harder to investigate.

Some questions to help me understand what is happening:

What does the the code in pan-dora/ldp-client is do? If it is concurrent
operations, what else is the client doing at the time?

Does it work if you go back to 3.4.0 after using 3.5.0 where it went
wrong?

What's the Fuseki server configuration?

What's the history of the TDB database? How long ago was it created and
has it been used outside of Fuseki or in different Fuseki
configurations?

JENA-1482 (which isn't in a released version of Jena) is a few
diagnostics
to catch error situations earlier.

      Andy

On 30/03/18 07:06, Christopher Johnson wrote:

Hi list,

I apologize, but on further evaluation, I have determined that it
definitely works in 3.4.0 but not in 3.5.0.  So, the scope of changes
(since 2017 July) is a bit broader.
Also, for reference, here is a link to the client method
<https://github.com/pan-dora/ldp-client/blob/master/src/main
/java/cool/pandora/ldpclient/LdpClientImpl.java#L875-L879>.
If you need detailed server logs, please let me know.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig

On 30 March 2018 at 07:08, Christopher Johnson <[email protected]>
wrote:

Hi list,


I am developing a client that writes asynchronously to Fuseki via
TrellisLDP <https://github.com/trellis-ldp/trellis>.  There seems to
be
an issue since 3.6.0 that throws a BindingTBD error.

The exception is this:

fuseki     | [2018-03-27 17:02:44] BindingTDB ERROR get1(?parent)

fuseki     | java.nio.BufferOverflowException
fuseki     |    at java.nio.HeapByteBuffer.put(He
apByteBuffer.java:214)
fuseki     |    at sun.nio.ch.IOUtil.read(IOUtil.java:200)



I looked at the code in BindingTBD and reverted the change in
JENA-1482:
tests for null NodeIds at the TDB level" to see if a different result
would
occur.

Now it throws this exception:

fuseki     | [2018-03-30 03:39:34] BindingTDB ERROR get1(?object)

fuseki     | org.apache.jena.tdb.base.file.FileException: In the
middle
of an alloc-write
fuseki     |    at org.apache.jena.tdb.base.objec
tfile.ObjectFileStorage.
read(ObjectFileStorage.java:311)



This looks like a concurrency issue.  I do not know why it works in
3.5.0.  Any ideas?

Thank you for your consideration of this.

Christopher Johnson
Scientific Associate
Universitätsbibliothek Leipzig







Reply via email to