After a bit of work, I have what appears to be a working version of Jena
with batching SERVICE calls. It's sort of complex, so I'll be adding more
tests and documentation before submitting a pull request to y'all.  Is
there any contributor docs I should read, particularly around coding
standards, configurability, or level of testing expected.  I'd hate to get
the etiquette wrong here.  Around level of testing in particular, this is
as I say a pretty complex feature and deserves to be fully tested, but I'd
hate to slow down your (pretty darn fast) build.

Thanks.  It's been a delight extending your work.

Dave Griffith
Principal Engineer
data.world

On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <[email protected]> wrote:

> Dave,
>
> By changing the order of parts of the query, the number of SERVICE calls
> can change.  Sometimes it is better to grab more data, once, than many
> small calls. And not just for performance if the remote endpoint is
> across the unreliable internet.
>
> As Rob says, batching for SERVICE calls would be good to have.
>
>      Andy
>
> On 01/05/2019 09:40, Rob Vesse wrote:
> > Dave
> >
> > Yes this is what is happening.  This stems from the fact that ARQ is
> designed as a lazy streaming evaluation engine i.e. It tries to do the
> least work possible to answer the query. This is why the underlying
> implementation is all iterator driven.  In some cases the engine does have
> to batch up everything in order to proceed e.g. DISTINCT/aggregation
> >
> > Introducing some degree of batching for SERVICE blocks might be a nice
> optimisation. I think this will definitely be valuable to the community,
> contributions are always appreciated
> >
> > Thanks,
> >
> > Rob
> >
> > On 30/04/2019, 18:31, "Dave Griffith" <[email protected]> wrote:
> >
> >      I'm tracking down an issue with a very slow federated query.
> Looking
> >      through logs, Jena appears to be doing one call to the remote
> endpoint for
> >      every set of values that match locally.  This struck me as odd,
> since the
> >      SPARQL federation specs suggest that implementations may create
> "batched"
> >      queries to remote endpoints using VALUES blocks to pass multiple
> bindings.
> >      Looking through the source, it appears that Jena isn't doing that,
> but
> >      instead actually is issuing one remote call per binding.
> >
> >      Am I correct in assuming that this optimization isn't being done,
> or am I
> >      missing something?  Looking through the source, it looks like it
> wouldn't
> >      be _too_ difficult to change the QueryIterService class to batch up
> some
> >      number of results into an OpTable.  OpAsQuery.asQuery would then
> render
> >      that as a VALUES block before calling to the remote endpoint.
> There are a
> >      variety of issues to be resolved, most especially around batch
> size, but
> >      those don't appear insurmountable.  I haven't found any discussion
> of this
> >      possible optimization, but it's entirely possible I just didn't
> know where
> >      to look.  I'd be happy to do the work and submit a batch, but if
> there's a
> >      reason that people think this optimization shouldn't be done, I'd
> love to
> >      hear it before I start.
> >
> >      Thanks for reading, and I'd love to hear any thoughts on the matter.
> >
> >      Dave Griffith
> >      Principal Engineer
> >      data.world
> >
> >
> >
> >
> >
>

Reply via email to