Re: SDB Performance Issue

Andy Seaborne Thu, 21 Mar 2013 09:00:54 -0700

> The actual query is irrelevant -- this happens for any construct query.

The query does affect the number of times SDB need to go to the SQLdatabase.


Do you have an example we can look at?

If it is a single basic block pattern it should go once to the DB; otherquery structures have to go several, possibly many, times

 Do you have
any recommendations for the fastest way to load all triples in a named
model from the database into memory?

What transaction control have you got round the code? It may be fasterto start a JDBC-level transaction and then perform the update.


Adding with model.add(otherModel) should trigger the bulk loading path.

        Andy


On 21/03/13 00:05, Rob Vesse wrote:

Comments inline:


On 3/19/13 7:45 PM, "Aaron Jackson" <[email protected]> wrote:

Thank you Rob.  So, the results of subsequent calls to next() are
dependent
on previous calls?


Sometimes, generally speaking a single call to next() may necessitate many
db calls

Is there any way to execute "in bulk" or as a batch to
reduce the number of db calls?


Not currently, this would be a nice enhancement if someone had the time to
work on it but SDB has little active development right now.  Remember that
all the developers are volunteers with full time jobs and generally we
work on the parts of Jena that interest us or our respective employers pay
us to work on.

Regardless part of the problem is that many things in SPARQL cannot be
translated into db calls because of the very different expression
evaluation semantics.  Thus often the actual path of execution is to make
some SQL query, translate the result into Jena objects, calculate an
expression in Java and then feed that to the next SQL query.  This kind of
mix of Java and SQL is likely very hard to batch up in a meaningful way.

Also batching is likely to be very backend dependent so it may be hard to
implement this in a way that isn't heavily tied to one specific backend.


I'm trying to find a balance between loading the entire triple store into
memory and leaving everything in the database.  We have implemented an LRU
cache to further reduce the number of queries, but there are still cases
where a fairly complex query will run (with good time in memory) but which
takes an hour or longer to process against the db.


Is there a particular reason you are using SDB?

TDB is much more actively developed, more scalable and performant.  It
uses memory mapped files so loading a large database can requir a good
amount of RAM on your machine but will likely yield much better
performance for what you have described so far.


On another note, the problem could be significantly eased by being able to
load the entire store more efficiently.  Right now we have about 96 MB of
data (maybe about 100,000 triples?).  We are loading them into a Model by
just calling execConstruct() on a simple "select ?s ?p ?o", resulting in a
returned Model.  It takes about 7 minutes with the current data set.  Is
that call resulting in numerous calls to the database as well?


Yes, essentially what execConstruct() does under the hood is take the
WHERE part, call the internals of execSelect() to get an iterator over the
solutions and then passes them through the construct template to generate
the in-memory model.

Do you have
any recommendations for the fastest way to load all triples in a named
model from the database into memory?


I am not personally that familiar with the SDB code but you may want to
look at SDBFactory.connectNamedModel() though I don't think that loads the
entire graph into a model rather presents a wrapper over SDB which calls
into the DB when necessary

Rob


Thanks again,

Aaron

On Tue, Mar 19, 2013 at 6:49 PM, Rob Vesse <[email protected]> wrote:

Hi Aaron

Here you are encountering a common misconception among users which we as
developers clearly need to do better at covering in the documentation
which is that calling execBlah() on a query execution actually fully
executes the query.  When in fact what execBlah() does varies according
to
the exact variant called.

In any of the cases where you are receiving some form of iterator in
response that iterator is essentially just a plan for how to execute
that
query, only when you start iterating over the iterator does any work get
done.


Depending on the SDB layout, backend and SPARQL query used SDB may have
to
translate your SPARQL query into arbitrarily many SQL queries because
much
of the work often cannot be pushed off to the database level.  It sounds
like this is what you are seeing in your scenario.

Rob

On 3/19/13 1:51 PM, "Aaron Jackson" <[email protected]> wrote:

Hi,

I have been working on a client project on which we have used the Jena

SDB

implementation (1.3.5 snapshot backed by Oracle) as our triple store.

The basic issue we have is as follows:

During implementation we encountered some fairly severe performance
restrictions during querying -- not during the actual query execution,

but

during the subsequent iteration over the resulting triples (we are
primarily using CONSTRUCT).  It seems that the iterator is reaching

out to

the underlying Oracle instance on nearly every iteration, which, when

we

have potentially thousands of triples in the results, is extremely
prohibitive.

The solution we implemented was to pre-load all the triples into a

"live"

in-memory model up front, which gave us the performance we needed.
However, we are now approaching a size where loading the entire model

is

no
longer feasible, in terms of footprint and initial load time.  I

realize

the models can be broken up in many ways, allowing only partial loads,

but

the problem is that we have no real way of knowing what data the system
might be interested in beforehand -- any historical data needs to
available
for querying at any time.

My question is whether anyone has encountered this issue before, how

they

may have handled it, and whether there is a setting we are missing or
another way to handle this.  Without digging into the weeds I can't be
sure, but it seems like the iterator's implementation could be

optimized

to
significantly reduce the number of calls to the database.

The actual query is irrelevant -- this happens for any construct query.

Here is the basic code that is running slowly (it calls the db in most
iterations of the loop).  The Iterator is returned very quickly from

the

QueryExecution.execConstructTriples method.

protected void iterate(Iterator<com.hp.hpl.jena.graph.Triple> it)
    {
        while (it.hasNext())
        {
            com.hp.hpl.jena.graph.Triple jenaTrip = it.next();
            //this call to next is resulting in a call to Oracle

through

jdbc
            //add the triple to another list
        }
    }

If SQL logging is turned on, you can easily see the large number of
independent calls as the loop executes.

Thanks,

Aaron


--
Aaron Jackson
Lead Solution Architect
Blue Slate Solutions | Phone: 518.810.0372 | Cell: 845.392.6923
Email: [email protected] | www.blueslate.net



--
Aaron Jackson
Lead Solution Architect
Blue Slate Solutions | Phone: 518.810.0372 | Cell: 845.392.6923
Email: [email protected] | www.blueslate.net

Re: SDB Performance Issue

Reply via email to