Re: Batching federated calls using VALUES block

Rob Vesse Mon, 10 Jun 2019 01:43:44 -0700

Dave

Thanks for continuing to look at this.  We don't have strict code standards as 
with such an old code base there are a wide variety of code styles so the 
general rule is to follow the surrounding code style. 
https://jena.apache.org/getting_involved/reviewing_contributions.html details 
our reviewing guidelines but these are fairly flexibly enforced.


Yes good test coverage for something like this would be a must.  If the tests 
are particularly slow they can always be put in a separate module and excluded 
from the faster "dev" profile.  You may want to create a separate tests module 
anyway because then you'd able to depend on Fuseki embedded and bring up Fuseki 
servers as part of your tests.  Since the SERVICE logic lives in ARQ trying to 
depend on Fuseki from there would create a circular dependency.  See 
https://jena.apache.org/documentation/fuseki2/fuseki-run.html#fuseki-main, 
particularly the bit on Fuseki as a Configurable and Embeddable SPARQL Server

Rob

On 07/06/2019, 19:19, "Dave Griffith" <[email protected]> wrote:

    After a bit of work, I have what appears to be a working version of Jena
    with batching SERVICE calls. It's sort of complex, so I'll be adding more
    tests and documentation before submitting a pull request to y'all.  Is
    there any contributor docs I should read, particularly around coding
    standards, configurability, or level of testing expected.  I'd hate to get
    the etiquette wrong here.  Around level of testing in particular, this is
    as I say a pretty complex feature and deserves to be fully tested, but I'd
    hate to slow down your (pretty darn fast) build.
    
    Thanks.  It's been a delight extending your work.
    
    Dave Griffith
    Principal Engineer
    data.world
    
    On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <[email protected]> wrote:
    
    > Dave,
    >
    > By changing the order of parts of the query, the number of SERVICE calls
    > can change.  Sometimes it is better to grab more data, once, than many
    > small calls. And not just for performance if the remote endpoint is
    > across the unreliable internet.
    >
    > As Rob says, batching for SERVICE calls would be good to have.
    >
    >      Andy
    >
    > On 01/05/2019 09:40, Rob Vesse wrote:
    > > Dave
    > >
    > > Yes this is what is happening.  This stems from the fact that ARQ is
    > designed as a lazy streaming evaluation engine i.e. It tries to do the
    > least work possible to answer the query. This is why the underlying
    > implementation is all iterator driven.  In some cases the engine does have
    > to batch up everything in order to proceed e.g. DISTINCT/aggregation
    > >
    > > Introducing some degree of batching for SERVICE blocks might be a nice
    > optimisation. I think this will definitely be valuable to the community,
    > contributions are always appreciated
    > >
    > > Thanks,
    > >
    > > Rob
    > >
    > > On 30/04/2019, 18:31, "Dave Griffith" <[email protected]> wrote:
    > >
    > >      I'm tracking down an issue with a very slow federated query.
    > Looking
    > >      through logs, Jena appears to be doing one call to the remote
    > endpoint for
    > >      every set of values that match locally.  This struck me as odd,
    > since the
    > >      SPARQL federation specs suggest that implementations may create
    > "batched"
    > >      queries to remote endpoints using VALUES blocks to pass multiple
    > bindings.
    > >      Looking through the source, it appears that Jena isn't doing that,
    > but
    > >      instead actually is issuing one remote call per binding.
    > >
    > >      Am I correct in assuming that this optimization isn't being done,
    > or am I
    > >      missing something?  Looking through the source, it looks like it
    > wouldn't
    > >      be _too_ difficult to change the QueryIterService class to batch up
    > some
    > >      number of results into an OpTable.  OpAsQuery.asQuery would then
    > render
    > >      that as a VALUES block before calling to the remote endpoint.
    > There are a
    > >      variety of issues to be resolved, most especially around batch
    > size, but
    > >      those don't appear insurmountable.  I haven't found any discussion
    > of this
    > >      possible optimization, but it's entirely possible I just didn't
    > know where
    > >      to look.  I'd be happy to do the work and submit a batch, but if
    > there's a
    > >      reason that people think this optimization shouldn't be done, I'd
    > love to
    > >      hear it before I start.
    > >
    > >      Thanks for reading, and I'd love to hear any thoughts on the 
matter.
    > >
    > >      Dave Griffith
    > >      Principal Engineer
    > >      data.world
    > >
    > >
    > >
    > >
    > >
    >

Re: Batching federated calls using VALUES block

Reply via email to