Re: Batching federated calls using VALUES block

Andy Seaborne Sat, 08 Jun 2019 10:50:38 -0700



On 07/06/2019 19:18, Dave Griffith wrote:

After a bit of work, I have what appears to be a working version of Jena
with batching SERVICE calls. It's sort of complex, so I'll be adding more
tests and documentation before submitting a pull request to y'all.  Is
there any contributor docs I should read, particularly around coding
standards, configurability, or level of testing expected.  I'd hate to get
the etiquette wrong here.


https://github.com/apache/jena/blob/master/CONTRIBUTING.md

Style is more of a "we prefer" - the most important thing is thecontribution!


If it is modifying existing code, follow the style of the class.
The codebase has a long history - different styles in different places.

Around level of testing in particular, this is
as I say a pretty complex feature and deserves to be fully tested, but I'd
hate to slow down your (pretty darn fast) build.

Forking off a Fuseki server is not too expensive - the build does itmultiple times already.

This is what the jena-integration-tests/ module is for - you do ofcourse need Fuseki built to launch it and client-server testing ends upin this integration tests module.


Thanks.  It's been a delight extending your work.


Thank you!
Looking forward to the PR,

    Andy


Dave Griffith
Principal Engineer
data.world

On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <[email protected]> wrote:

Dave,

By changing the order of parts of the query, the number of SERVICE calls
can change.  Sometimes it is better to grab more data, once, than many
small calls. And not just for performance if the remote endpoint is
across the unreliable internet.

As Rob says, batching for SERVICE calls would be good to have.

      Andy

On 01/05/2019 09:40, Rob Vesse wrote:

Dave

Yes this is what is happening.  This stems from the fact that ARQ is

designed as a lazy streaming evaluation engine i.e. It tries to do the
least work possible to answer the query. This is why the underlying
implementation is all iterator driven.  In some cases the engine does have
to batch up everything in order to proceed e.g. DISTINCT/aggregation


Introducing some degree of batching for SERVICE blocks might be a nice

optimisation. I think this will definitely be valuable to the community,
contributions are always appreciated


Thanks,

Rob

On 30/04/2019, 18:31, "Dave Griffith" <[email protected]> wrote:

      I'm tracking down an issue with a very slow federated query.

Looking

      through logs, Jena appears to be doing one call to the remote

endpoint for

      every set of values that match locally.  This struck me as odd,

since the

      SPARQL federation specs suggest that implementations may create

"batched"

      queries to remote endpoints using VALUES blocks to pass multiple

bindings.

      Looking through the source, it appears that Jena isn't doing that,

but

      instead actually is issuing one remote call per binding.

      Am I correct in assuming that this optimization isn't being done,

or am I

      missing something?  Looking through the source, it looks like it

wouldn't

      be _too_ difficult to change the QueryIterService class to batch up

some

      number of results into an OpTable.  OpAsQuery.asQuery would then

render

      that as a VALUES block before calling to the remote endpoint.

There are a

      variety of issues to be resolved, most especially around batch

size, but

      those don't appear insurmountable.  I haven't found any discussion

of this

      possible optimization, but it's entirely possible I just didn't

know where

      to look.  I'd be happy to do the work and submit a batch, but if

there's a

      reason that people think this optimization shouldn't be done, I'd

love to

      hear it before I start.

      Thanks for reading, and I'd love to hear any thoughts on the matter.

      Dave Griffith
      Principal Engineer
      data.world

Re: Batching federated calls using VALUES block

Reply via email to