Jeroen De Dauw wrote:

> > although I suppose the most general solution of all would be to
> implement aggregation queries.
> > ..
>
> > I guess GROUP BY and COUNT() functionality are the bits that would would
> jeopardize sanity? :)
>
> I actually discussed this at length with Yaron, and we concluded that
> generic group by functionality would not be terribly useful, since it's
> hard to imagine cases where you would not just want to count the
> occurrences. My current implementation is pretty much equivalent to doing a
> group by count I think (not sure, as I'm not that familiar with the SQL
> group by statement).
>
>
GROUP BY is basically a way to tell the SQL parser that you want to feed
every hit where field X has the same value into an aggregate function such
as COUNT or SUM; for all aggregate functions *except* COUNT, this assumes
that the function will be taking its parameters from another field or
fields.

This actually *does* apply to inline queries.  Take, for example, the
following, taken from the SMW Wiki:

{{#ask: [[Category:City]] [[located in::Germany]]
| ?population
| ?area#km² = Size in km²
}}

This produces:

[image: ↓] <http://semantic-mediawiki.org/wiki/Help:Inline_queries#>
Population <http://semantic-mediawiki.org/wiki/Property:Population>[image:
↓] <http://semantic-mediawiki.org/wiki/Help:Inline_queries#> Size in
km²<http://semantic-mediawiki.org/wiki/Property:Area>[image:
↓] <http://semantic-mediawiki.org/wiki/Help:Inline_queries#>
Berlin<http://semantic-mediawiki.org/wiki/Berlin>
3,391,409 891.85 km²  Frankfurt<http://semantic-mediawiki.org/wiki/Frankfurt>
679,664 248.31 km²  Munich <http://semantic-mediawiki.org/wiki/Munich>
1,259,678 310.43 km²  Stuttgart<http://semantic-mediawiki.org/wiki/Stuttgart>
595,452 208.754 km²
(Which I hope is legible in your email client.)

If this was the return set for a database query, one could tweak it by
grouping by "located in" and returning, say, the total number of people
living in Germany's cities, or the average number of square kilometers in a
German city.  Replace "Located in::Germany" with "Continent::Europe" while
keeping things grouped by "Located in", and you could run a comparison of
the urban populations of Germany, France, Switzerland, etc.

The real question isn't whether or not such a query would be useful; the
question is whether or not it would be useful *enough* to justify the
complications and overhead that would come with implementing it.  Do we
really want people performing statistical analysis by means of inline
queries, or would the business of grouping pages by property and aggregate
results within those groups be better handled by a third-party ontology
engine?

If we *do* decide that a more comprehensive "aggregate results" inline
query is warranted, I'd suggest *not* trying to shoehorn it into #ask.  For
example:

{{#summarize: [[Category:City]] [[located in::Germany]]
| shared=located in
| ?sum(population) = urban population
| ?avg(area)#km² = Average Size
}}

#summarize would be similar to #ask, except that there would be a mandatory
shared parameter, all of the printout statements would be assumed to be
aggregate functions, and the result formats would use the values of the
shared property instead of the names of the matching pages:

[image: ↓] Urban
Population<http://semantic-mediawiki.org/wiki/Property:Population>[image:
↓] Average Size
<http://semantic-mediawiki.org/wiki/Property:Area>[image:
↓]<http://semantic-mediawiki.org/wiki/Help:Inline_queries#>
Germany
5,926,203 414.836 km²

Again, the main issue here is the overhead that you're likely to encounter
implementing this sort of thing.  How do you keep the processing overhead
to a minimum, and how low can that minimum be?  Which aggregate functions
does #summarize recognize?  (For instance, I could see arguments for
recognizing aggregate functions such as "count if" and "sum of product", to
borrow two fairly useful examples from the spreadsheet world; but that
would entail more work on the designers' part, if only in the form of
providing a light-weight but secure hook for others to use in creating
their own.)  And so on.

-- 
Jonathan "Dataweaver" Lang
------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Semediawiki-devel mailing list
Semediawiki-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel

Reply via email to