Ok -
I tried the DatasetAccessor approach and discovered the subtlety in the phrase
"PUT the entire updated graph to the server.” What I found is that, upon each
update, the graph subset that I’m sending as an update becomes the complete
graph.
My use case is somewhat unusual: I have a set of statements that I’m wrapping
with
"INSERT DATA { " + statements + "}";
and using the UpdateProcessor from UpdateExecutionFactory to send to Fuseki.
From the logs I’m assuming that the UpdateProcessor iterates through the
statements and posts them individually, leading to a potentially large number
of individual, rapid-fire updates (?).
So what I attempted to do is to read those statements into a model and post
them using the DataAccessor. Bad idea.
What I’d like to have done is to send across a set of statements and have the
set be processed as an update. And, thinking ahead, I’m fairly certain that the
update should be treated as a transaction (across the set of statements).
Is anything like that supported?
BTW - SSD for TDB does seem to have addressed my current update rate problem -
I’ve not seen Fuseki lockups since moving to SSD. But my current code is a
prototype; I expect there’ll be an increase in update rates that will
eventually nullify this approach.
Thanks,
Mark
On Jun 3, 2014, at 5:54 AM, Andy Seaborne <[email protected]> wrote:
> On 03/06/14 09:43, Rob Vesse wrote:
>> Mark
>>
>> Answers inline:
>>
>> On 02/06/2014 19:14, "Mark Feblowitz" <[email protected]> wrote:
>>
>>> Ok -
>>>
>>> It’s another day… after a weekend’s rest. And now I have a few
>>> observations and many questions:
>>>
>>> So, a bit more about what I’m doing (much of it likely naive):
>>>
>>> I create a local ontmodel, create several statements in that model,
>>> convert the model to a string, wrap it with INSERT DATA { }, then post it
>>> to Fuseki using an Update Request:
>>>
>>> UpdateRequest ur = new UpdateRequest();
>>> ur.add(payloadContent);
>>> UpdateProcessor up =
>>> UpdateExecutionFactory.createRemote(ur, service);
>>> up.execute();
>>> } catch (Exception e) {
>>> e.printStackTrace();
>>> } finally {
>>> }
>>
>> If your graphs are relatively small then it might be more efficient to
>> look at using DatasetAccessor
>> (https://jena.apache.org/documentation/javadoc/arq/com/hp/hpl/jena/query/Da
>> tasetAccessor.html) to PUT the entire updated graph to the server via the
>> Graph Store Protocol (SPARQL 1.1 Graph Store HTTP Protocol
>> <http://www.w3.org/TR/sparql11-http-rdf-update/>) e.g.
>>
>> DatasetAccessor accessor =
>> DatasetAccessorFactory.createHTTP("http://localhost:3030/ds/data");
>> accessor.putModel(model, "http://example.org/graph");
>>
>>>
>>>
>>> I have standalone code that sends SPARQL queries to the same Fuseki
>>> server, using the usual methods.
>>>
>>> So, before I go deeper into the questions, it occurs to me that I don’t
>>> necessarily need to post remotely, and removing that overhead would be an
>>> obvious improvement . I’m wondering whether my update code could interact
>>> directly with TDB while also having the content be queriable via Fuseki.
>>> I recall trying that a while ago and noting that direct posts to TDB were
>>> not available for query until restarting Fuseki. Was that a (repaired)
>>> bug or a feature?
>>
>> DO NOT do this!
>
> It is not a bug (would you edit the files for MySQL bypassing the MySQL
> engine?)
>
>> TDB is only designed for single JVM usage and attaching multiple JVMs to a
>> dataset runs the risk of irrevocably corrupting your dataset. Work is
>> actually under way to add functionality to TDB to prevent users doing this
>> because we've seen far too many cases of data corruption resulting from
>> this.
>>
>>
>>>
>>> If I can’t do this, do you know of any projects that have created a
>>> server that can be run with my application and share the TDB connection?
>>
>> You want a triple store that can run in both embedded and remote mode
>> simultaneously? No I don't know of any
>>
>> There are stores that support non-HTTP protocols (ODBC, custom protocol
>> buffer based etc) that may offer better performance and allow you to have
>> multiple applications talking to the server over different protocols but
>> they still require a single server. Most of these are however commercial
>> options but we can provide your our recommendations if you want them.
>>
>>>
>>> Assuming that I do stick with remote updates, I have some questions:
>>>
>>> First, note that I create a new Update processor for each posted update.
>>> I don’t see any up.close() so I’m guessing I’m not creating any leakage
>>> here. Would it be better to create one update processor and reuse it
>>> across updates?
>>
>> Yes each remote update request is a self contained HTTP request with no
>> response body expected so the connection is closed automatically after the
>> request is made and the response headers read.
>>
>>>
>>> Next, I’m seeing a lot of notices like the following in the Fuseki
>>> console:
>>>
>>> 13:40:07 INFO [19125] 204 No Content (21 ms)
>>> 13:40:07 INFO [19126] POST http://localhost:3030/km4sp/update
>>> 13:40:07 INFO [19126] 204 No Content (13 ms)
>>> 13:40:07 INFO [19127] POST http://localhost:3030/km4sp/update
>>> 13:40:07 INFO [19127] 204 No Content (16 ms)
>>> 13:40:07 INFO [19128] POST http://localhost:3030/km4sp/update
>>> 13:40:07 INFO [19128] 204 No Content (14 ms)
>>> 13:40:07 INFO [19129] POST http://localhost:3030/km4sp/update
>>> 13:40:07 INFO [19129] 204 No Content (8 ms)
>>> 13:40:07 INFO [19130] POST http://localhost:3030/km4sp/update
>>>
>>> I can't tell whether this reflects
>>> 1) a high overall number of update calls from my code, or
>>> 2) that the Fuseki side processes each of the triples for an update as
>>> an http post.
>
> The client API sends one SPARQL update for each SPARQL update the app
> requests. It does not break up a INSERT DATA nor does it combine them.
>
>>
>> It reflects both
>>
>> A remote update request is executed by making a HTTP POST to the /update
>> endpoint seen in your logs
>>
>> 204 No Content is the standard Fuseki response to a successful update
>> since there is not expected to be any response body from an update
>>
>>>
>>> If it’s the latter, I don’t think I can batch up more updates to improve
>>> overall performance.
>>>
>>> I guess I can tell by inserting delays between my posts and capturing
>>> timestamps in my log that can be compared against what I’m seeing in the
>>> Fuseki console. I’ll try that next.
>>
>> The number in square brackets is the request ID and is included in the
>> response as a Fuseki-Request-ID header. Returned headers aren't exposed
>> to the ARQ code per se but you can up the debugging level for the
>> org.apache.http packages for your client application to see detailed HTTP
>> traces in the logs of the communications between your application and
>> Fuseki. You'll need DEBUG or TRACE level depending on the version of ARQ,
>> more recent versions of ARQ include more recent versions of Apache
>> HttpClient which need you to turn the log level all the way up to TRACE
>> while older ARQ versions only needed the log level up to DEBUG to see the
>> traces.
>
> I think if you run the server "--verbose" it'll print out yet more details as
> well.
>
> Andy
>
>>
>> Rob
>>
>>>
>>> I guess that’s it for the questions for now.
>>>
>>> Any recommendations?
>>>
>>> Thanks again,
>>>
>>> Mark
>>>
>>>
>>>
>>> On May 31, 2014, at 12:22 PM, Andy Seaborne <[email protected]> wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> The long running query is quite significant.
>>>>
>>>> On 30/05/14 18:26, Mark Feblowitz wrote:
>>>>> That’s a good idea.
>>>>>
>>>>> One improvement I’ve already made was to relocate the DB to local
>>>>> disk - having it on a shared filesystem is an even worse idea.
>>>>>
>>>>> The updates tend to be on the order of 5-20 triples at a time.
>>>>
>>>> If you could batch up changes, that will help for all sorts of reasons.
>>>> c.f. autocommit and JDBC where many small changes run really slowly.
>>>>
>>>> This is part of the issue - write transactions have a significant fixed
>>>> cost that you get for even a (theoretical) transaction of no changes. It
>>>> has to write a few bytes and do a disk-sync. Reads continue during this
>>>> time but longer write times means there is less chance of system being
>>>> able to write the journal to the main database. JENA-567 may help but
>>>> isn't faster (it's slower) but it saves memory.
>>>>
>>>> Read transactions have near zero cost in TDB - Fuseki/TDB is
>>>> read-centric.
>>>>
>>>> What's more, TDB block size is 8Kbytes so one change in a block is 8K
>>>> of transaction state. Multiple times for multiple indexes. So 5
>>>> triples of change get every little shared block effect and the memory
>>>> footprint is disproportionally large.
>>>>
>>>> <thinking out loud id=1>
>>>>
>>>> A block size of 1 or 2k for the leaf blocks in TDB, leaving the branch
>>>> blocks at 8k (they have in different block managers = files) would be
>>>> worth experimenting with.
>>>>
>>>> </thinking out loud>
>>>>
>>>> <thinking out loud id=2>
>>>>
>>>> We could provide some netty/mima/... bassqrted server that did the moral
>>>> equivalent of the SPARQL Protocol (cf jena-jdbc, jena-client). HTTP is
>>>> said to be an appreciable cost. This is no judgement of Jetty/tomcat,
>>>> it is the nature of HTTP; it is cautious wording because I haven't
>>>> observed it myself - Jetty locally, together with careful streaming of
>>>> results seems to be quite effective. Fast encoding of results would be
>>>> good for both.
>>>>
>>>> </thinking out loud>
>>>>
>>>>> I believe I identified the worst culprit, and that was using
>>>>> OWLFBRuleReasoner rather than RDFSExptRuleReasoner or
>>>>> TransitiveReasoner. My guess is that the longish query chain over a
>>>>> large triplestore, using the Owl reasoner was leading to very long
>>>>> query times and lots of memory consumption. Do you think that’s a
>>>>> reasonable guess?
>>>>
>>>> That does look right. Long running queries, or the effect of an
>>>> intense stream of small back-to-back queries combined with the update
>>>> pattern, leave no time for the system to flush the journal back to the
>>>> main database. This leads to memory usage and eventually OOME.
>>>>
>>>>> How I reached that conclusion was to kill the non-responsive (even
>>>>> for a small query) Fuseki and restart with RDFSExptRuleReasoner (same
>>>>> DB, with many triples). After that, both the small query and the
>>>>> multi-join query responded quite quickly.
>>>>>
>>>>> If necessary, I’ll try to throttle the posts, since I’m in complete
>>>>> control of the submissions.
>>>>
>>>> That should at least prove whether this discussion has correctly
>>>> diagnosed the interactions leading to OOME. What we have is ungrace-ful
>>>> ("disgraceful") behaviour as the load reaches system saturation. It
>>>> ought to be more graceful but, fundamentally, its always going to be
>>>> possible to flood a system, any system, with more work than it is
>>>> capable of.
>>>>
>>>> Andy
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark
>>>>>
>>>>> On May 30, 2014, at 12:52 PM, Andy Seaborne <[email protected]> wrote:
>>>>>
>>>>>> Mark,
>>>>>>
>>>>>> How big are the updates?
>>>>>>
>>>>>> An SSD for the database and the journal will help.
>>>>>>
>>>>>> Every transaction is a commit, and a commit is a disk operation to
>>>>>> ensure the commit record is permanent. That is not cheap with a
>>>>>> rotational disk (seek time), and much better with an SSD.
>>>>>>
>>>>>> If you are driving Fuseki as hard as possible, something will break
>>>>>> - the proposal in JENA-703 amounts to slowing the clients down as
>>>>>> well as being more defensive.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> On 30/05/14 15:39, Rob Vesse wrote:
>>>>>>> Mark
>>>>>>>
>>>>>>> This sounds like the same problem described in
>>>>>>> https://issues.apache.org/jira/browse/JENA-689
>>>>>>>
>>>>>>> TL;DR
>>>>>>>
>>>>>>> For a system with no quiescent periods continually receiving
>>>>>>> updates the in-memory journal continues to expand until such time
>>>>>>> as an OOM occurs. There will be little/no data loss because the
>>>>>>> journal is a write ahead log and is first written to disk (you
>>>>>>> will lose at most the data from the transaction that encountered
>>>>>>> the OOM). Therefore once the system is restarted the journal
>>>>>>> will be replayed and flushed.
>>>>>>>
>>>>>>> See https://issues.apache.org/jira/browse/JENA-567 for an
>>>>>>> experimental feature that may mitigate this and see
>>>>>>> https://issues.apache.org/jira/browse/JENA-703 for the issue
>>>>>>> tracking the work to remove this limitation
>>>>>>>
>>>>>>> Rob
>>>>
>>>
>>
>>
>>
>>
>