Re: SDB Performance Issue

Rob Vesse Wed, 20 Mar 2013 17:06:16 -0700

Comments inline:


On 3/19/13 7:45 PM, "Aaron Jackson" <[email protected]> wrote:

>Thank you Rob.  So, the results of subsequent calls to next() are
>dependent
>on previous calls?

Sometimes, generally speaking a single call to next() may necessitate many
db calls

>Is there any way to execute "in bulk" or as a batch to
>reduce the number of db calls?

Not currently, this would be a nice enhancement if someone had the time to
work on it but SDB has little active development right now.  Remember that
all the developers are volunteers with full time jobs and generally we
work on the parts of Jena that interest us or our respective employers pay
us to work on.

Regardless part of the problem is that many things in SPARQL cannot be
translated into db calls because of the very different expression
evaluation semantics.  Thus often the actual path of execution is to make
some SQL query, translate the result into Jena objects, calculate an
expression in Java and then feed that to the next SQL query.  This kind of
mix of Java and SQL is likely very hard to batch up in a meaningful way.

Also batching is likely to be very backend dependent so it may be hard to
implement this in a way that isn't heavily tied to one specific backend.

>
>I'm trying to find a balance between loading the entire triple store into
>memory and leaving everything in the database.  We have implemented an LRU
>cache to further reduce the number of queries, but there are still cases
>where a fairly complex query will run (with good time in memory) but which
>takes an hour or longer to process against the db.

Is there a particular reason you are using SDB?

TDB is much more actively developed, more scalable and performant.  It
uses memory mapped files so loading a large database can requir a good
amount of RAM on your machine but will likely yield much better
performance for what you have described so far.

>
>On another note, the problem could be significantly eased by being able to
>load the entire store more efficiently.  Right now we have about 96 MB of
>data (maybe about 100,000 triples?).  We are loading them into a Model by
>just calling execConstruct() on a simple "select ?s ?p ?o", resulting in a
>returned Model.  It takes about 7 minutes with the current data set.  Is
>that call resulting in numerous calls to the database as well?

Yes, essentially what execConstruct() does under the hood is take the
WHERE part, call the internals of execSelect() to get an iterator over the
solutions and then passes them through the construct template to generate
the in-memory model.

>Do you have
>any recommendations for the fastest way to load all triples in a named
>model from the database into memory?

I am not personally that familiar with the SDB code but you may want to
look at SDBFactory.connectNamedModel() though I don't think that loads the
entire graph into a model rather presents a wrapper over SDB which calls
into the DB when necessary

Rob

>
>Thanks again,
>
>Aaron
>
>On Tue, Mar 19, 2013 at 6:49 PM, Rob Vesse <[email protected]> wrote:
>
>> Hi Aaron
>>
>> Here you are encountering a common misconception among users which we as
>> developers clearly need to do better at covering in the documentation
>> which is that calling execBlah() on a query execution actually fully
>> executes the query.  When in fact what execBlah() does varies according
>>to
>> the exact variant called.
>>
>> In any of the cases where you are receiving some form of iterator in
>> response that iterator is essentially just a plan for how to execute
>>that
>> query, only when you start iterating over the iterator does any work get
>> done.
>>
>>
>> Depending on the SDB layout, backend and SPARQL query used SDB may have
>>to
>> translate your SPARQL query into arbitrarily many SQL queries because
>>much
>> of the work often cannot be pushed off to the database level.  It sounds
>> like this is what you are seeing in your scenario.
>>
>> Rob
>>
>> On 3/19/13 1:51 PM, "Aaron Jackson" <[email protected]> wrote:
>>
>> >Hi,
>> >
>> >I have been working on a client project on which we have used the Jena
>>SDB
>> >implementation (1.3.5 snapshot backed by Oracle) as our triple store.
>> >
>> >The basic issue we have is as follows:
>> >
>> >During implementation we encountered some fairly severe performance
>> >restrictions during querying -- not during the actual query execution,
>>but
>> >during the subsequent iteration over the resulting triples (we are
>> >primarily using CONSTRUCT).  It seems that the iterator is reaching
>>out to
>> >the underlying Oracle instance on nearly every iteration, which, when
>>we
>> >have potentially thousands of triples in the results, is extremely
>> >prohibitive.
>> >
>> >The solution we implemented was to pre-load all the triples into a
>>"live"
>> >in-memory model up front, which gave us the performance we needed.
>> >However, we are now approaching a size where loading the entire model
>>is
>> >no
>> >longer feasible, in terms of footprint and initial load time.  I
>>realize
>> >the models can be broken up in many ways, allowing only partial loads,
>>but
>> >the problem is that we have no real way of knowing what data the system
>> >might be interested in beforehand -- any historical data needs to
>> >available
>> >for querying at any time.
>> >
>> >My question is whether anyone has encountered this issue before, how
>>they
>> >may have handled it, and whether there is a setting we are missing or
>> >another way to handle this.  Without digging into the weeds I can't be
>> >sure, but it seems like the iterator's implementation could be
>>optimized
>> >to
>> >significantly reduce the number of calls to the database.
>> >
>> >The actual query is irrelevant -- this happens for any construct query.
>> >
>> >Here is the basic code that is running slowly (it calls the db in most
>> >iterations of the loop).  The Iterator is returned very quickly from
>>the
>> >QueryExecution.execConstructTriples method.
>> >
>> >protected void iterate(Iterator<com.hp.hpl.jena.graph.Triple> it)
>> >    {
>> >        while (it.hasNext())
>> >        {
>> >            com.hp.hpl.jena.graph.Triple jenaTrip = it.next();
>> >            //this call to next is resulting in a call to Oracle
>>through
>> >jdbc
>> >            //add the triple to another list
>> >        }
>> >    }
>> >
>> >If SQL logging is turned on, you can easily see the large number of
>> >independent calls as the loop executes.
>> >
>> >Thanks,
>> >
>> >Aaron
>> >
>> >
>> >--
>> >Aaron Jackson
>> >Lead Solution Architect
>> >Blue Slate Solutions | Phone: 518.810.0372 | Cell: 845.392.6923
>> >Email: [email protected] | www.blueslate.net
>>
>>
>
>
>-- 
>Aaron Jackson
>Lead Solution Architect
>Blue Slate Solutions | Phone: 518.810.0372 | Cell: 845.392.6923
>Email: [email protected] | www.blueslate.net

Re: SDB Performance Issue

Reply via email to