After a bit of work, I have what appears to be a working version of Jena with batching SERVICE calls. It's sort of complex, so I'll be adding more tests and documentation before submitting a pull request to y'all. Is there any contributor docs I should read, particularly around coding standards, configurability, or level of testing expected. I'd hate to get the etiquette wrong here. Around level of testing in particular, this is as I say a pretty complex feature and deserves to be fully tested, but I'd hate to slow down your (pretty darn fast) build.
Thanks. It's been a delight extending your work. Dave Griffith Principal Engineer data.world On Wed, May 1, 2019 at 4:34 AM Andy Seaborne <[email protected]> wrote: > Dave, > > By changing the order of parts of the query, the number of SERVICE calls > can change. Sometimes it is better to grab more data, once, than many > small calls. And not just for performance if the remote endpoint is > across the unreliable internet. > > As Rob says, batching for SERVICE calls would be good to have. > > Andy > > On 01/05/2019 09:40, Rob Vesse wrote: > > Dave > > > > Yes this is what is happening. This stems from the fact that ARQ is > designed as a lazy streaming evaluation engine i.e. It tries to do the > least work possible to answer the query. This is why the underlying > implementation is all iterator driven. In some cases the engine does have > to batch up everything in order to proceed e.g. DISTINCT/aggregation > > > > Introducing some degree of batching for SERVICE blocks might be a nice > optimisation. I think this will definitely be valuable to the community, > contributions are always appreciated > > > > Thanks, > > > > Rob > > > > On 30/04/2019, 18:31, "Dave Griffith" <[email protected]> wrote: > > > > I'm tracking down an issue with a very slow federated query. > Looking > > through logs, Jena appears to be doing one call to the remote > endpoint for > > every set of values that match locally. This struck me as odd, > since the > > SPARQL federation specs suggest that implementations may create > "batched" > > queries to remote endpoints using VALUES blocks to pass multiple > bindings. > > Looking through the source, it appears that Jena isn't doing that, > but > > instead actually is issuing one remote call per binding. > > > > Am I correct in assuming that this optimization isn't being done, > or am I > > missing something? Looking through the source, it looks like it > wouldn't > > be _too_ difficult to change the QueryIterService class to batch up > some > > number of results into an OpTable. OpAsQuery.asQuery would then > render > > that as a VALUES block before calling to the remote endpoint. > There are a > > variety of issues to be resolved, most especially around batch > size, but > > those don't appear insurmountable. I haven't found any discussion > of this > > possible optimization, but it's entirely possible I just didn't > know where > > to look. I'd be happy to do the work and submit a batch, but if > there's a > > reason that people think this optimization shouldn't be done, I'd > love to > > hear it before I start. > > > > Thanks for reading, and I'd love to hear any thoughts on the matter. > > > > Dave Griffith > > Principal Engineer > > data.world > > > > > > > > > > >
