Hi Paul,

Thanks again - since spec and implementation are in flux (and I'm not
very familiar with both of them), I was not sure about the results to
expect. But if you are suprised too, perhaps I should submit a bug
report (or feature request?). For the moment, I can work arround the
problem with CONSTRUCT queries against the remote dataset and loading
the result into the local one.

Your considerations beyond the current use case very much remind me of
query optimization approaches in relational databases. In Oracle, I
remember, for years and years you could (and in practice had to) add
"hints" to overrule the build-in, rules-based optimizer. They got it
almost right in Version 10, and what they did was using a cost-based
optimizer building on statistics gathered continously all over their
datasets. This was quite hard to approach in the closed world of a
single RDBMS. In the open world of the data web the only possible thing
may be, as you pointed out, to run count queries against the remote
endpoint prior to the query optimization. 

I'd argue that there are lots of use cases where 

1) queries are worked out once and run repeatedly 

2) the user knows already (or have some intuitive expectation which
could be easily verified) about the relative sizes on both sides of the
join, and can anyway measure the results of different query approaches.

So I doubt if it's worth the effort to make guesses and implement some
magic in the SPARQL processor. To gather the statistics as an integral
part of every request could be quite expensive and would penalize those
who had already worked it out.

Just my two cents - Joachim

> -----Original Message-----
> From: gea...@gmail.com [mailto:gea...@gmail.com] On Behalf Of 
> Paul Gearon
> Sent: Friday, July 06, 2012 8:49 PM
> To: users@jena.apache.org
> Subject: Re: Fuseki VALUES in combination with GROUP BY, 
> SERVICE and CONSTRUCT/INSERT
> 
> On Fri, Jul 6, 2012 at 2:19 PM, Neubert Joachim 
> <j.neub...@zbw.eu> wrote:
> > Hi Paul,
> >
> > Thank you very much for your enlightening explanation. I now 
> > understand why it could not work.
> >
> > But how could I achieve the aim of getting the right bits 
> out of the 
> > remote dataset? In the use case which is analogous to the example 
> > given here there are just a dozen labels in the VALUES 
> clause (which 
> > could be transformed to a UNION statement). They select about 5000 
> > uris out of a unrestricted set of about 1 million  In 
> another use case 
> > I would ask for about 100,000 uris out of 10,000,000 in the remote 
> > dataset. So fetching all the data and applying the 
> restriction locally is not an option.
> 
> That was my reasoning for the VALUES statement. However, I 
> was surprised that the following:
> 
> > The in my naive eyes most logical statement
> >
> >   CONSTRUCT { $book dc:title $title }
> >   WHERE {
> >     SERVICE <http://sparql.org/books/sparql>
> >     { SELECT ?book ?title
> >       WHERE { $book dc:title $title }
> >       VALUES ?book {
> >         <http://example.org/book/book1>
> >         <http://example.org/book/book2>
> >       }
> >     }
> >   }
> >
> > which I had tried first gave me an Error 500: Server Error. In the 
> > server log, I found
> >
> > 19:41:35 WARN  Fuseki               :: [81] RC = 500 : null
> > Not implemented
> >         at
> > 
> com.hp.hpl.jena.sparql.graph.NodeTransformOp.transform(NodeTransformOp
> > .j
> > ava:154)
> >         at
> > com.hp.hpl.jena.sparql.algebra.op.OpTable.apply(OpTable.java:63)
> >         at
> > 
> com.hp.hpl.jena.sparql.algebra.Transformer$ApplyTransformVisitor.visit
> > 0(
> > Transformer.java:270)
> >         ...
> >
> > - so perhaps - hopefully - the syntax above will be implemented 
> > eventually?
> 
> It should be. I'm not sure if it's the "local" engine 
> complaining, or the remote one (which happens to be the local 
> one in this case), but I think it's the local one. If it *is* 
> the local engine, then I'm surprised the values aren't just 
> passed through to the remote service.
> 
> 
> > VALUES/ex-BINDINGS is one of my favorite SPARQL 1.1 statements, 
> > because it allows restrictions with possibly long lists of values 
> > collected out-of-band. I was happy that in the current Fuseki 
> > implementation it is evaluated quite efficiently. My 
> feeling is that 
> > it should be possible to pass a VALUES clause explicitly as 
> a part of a SERVICE subclause.
> 
> Agreed. When it was proposed I advocated it for this purpose.
> 
> > Not sure, what you intended with appending a VALUES clause silently.
> > Does this refer to prior bindings  within the main clause?
> 
> Yes.
> 
> > This would
> > make a lot of sense, but perhaps it could be better 
> achieved with some 
> > explicit syntax under the users responsibility. In any case, an 
> > implicit VALUES clause should not interfere with an 
> explicit one given 
> > by the user.
> 
> No, the engine should have the ability to figure this out for itself.
> 
> For instance, consider the following WHERE clause:
> 
>      ?localBook dc:title ?title .
>      SERVICE <http://sparql.org/books/sparql>
>      { SELECT ?book ?title WHERE { ?book dc:title ?title } }
> 
> If ?localBook comes to just 10 book, whereas the remote 
> service contains 100,000, then you really want to send those 
> 10 books as a VALUES clause. However, if there are 100,000 
> local books, and only 10 remotely, then you really DON'T want 
> to send those 100,000 books along with the SERVICE request.
> 
> In general, you probably want to send along VALUES to bind 
> variables that are found in the remote service request so 
> long as the size of the bindings is less than the size of the 
> returned data. How much less? Well, that's a heuristic, and 
> is based on the overhead of sending the extra data vs the 
> reduced return size. Count queries are great to help work 
> this out, but they add a lot of overhead for small datasets. 
> Another one of the many problems is that even with sizes of 
> individual BGP resolutions, you can't know the size of the 
> returned data when the bindings are included unless you do 
> the join, so you have to guess the expected sizes after a join.
> 
> My point is that using COUNT/VALUES to make federated queries 
> more efficient is a complex task that must eventually get 
> done, but hasn't been addressed yet. Everything has to be 
> made correct with respect to the still-evolving spec before 
> it can be optimized.
> 
> Paul
> 

Reply via email to