Re: [Virtuoso-users] Best way to query a random sample?

Konrad Höffner Tue, 12 Apr 2011 10:44:55 +0000

Thank you very much, I didn't know this method!

In case I need not a specific number of triples but all triples for aspecific number of resources, I would just use bif:rnd(10,?s) instead ofbif:rnd(10,?s, ?p, ?o), is that correct?

I'm also unsure what value to assign to the "decimation factor" f for agiven sample size of n and instance count m.A naive way would be to use f = m/n (and thus having a probability foreach element to be chosen of p = n/m), however due to the variance ofthe normal distribution there would be a variety of result sizes.

Would the following approach be good?

1. Calculating the standard deviation sigma, which is (m*p*(1-p))^1/2for a normal distribution, simplified to (n*(1-n/m))^1/2, which is closeto n^1/2 for small sample sizes in relation to the whole instance count.

Example: n = 100, m = 10^6 -> sigma ~ 10, p = 10^-4

2. setting p' = p * (1+ c*sigma/m), where c is the safety level, e.g. 3for a 99.7 % chance of getting enough instances.

In the example p' = 10^-4 + 3*10^-5 = 1,3 *10^4

3. Using a "decimation factor" of 1/p', which is 10^4/1.3 in this caseand querying the endpoint.

4. Randomly chosing n elements out of the result set.

Best Regards,
Konrad

On 11.04.2011 22:34, Ivan Mikhailov wrote:

Hello Konrad,

The best, as usual, is decimation in its original style.

E.g.
select ?s ?p ?o from<some-graph>  where { ?s ?p ?o .
   FILTER (1>  bif:rnd (10, ?s, ?p, ?o))
  }

By tweaking first argument of bif:rnd() and the left side of the
inequality you can tweak decimation ratio from 1/10 to the desired
value. What's important is to know that the SQL optimizer has a right to
execute bif:rnd (10) only once at the beginning of the query, so we had
to pass additional three arguments that can be known only when a table
row is fetched so bif:rnd (10, ?s, ?p, ?o) is calculated for every row
and thus any given row is either returned or ignored independently from
others.

However, bif:rnd (10, ?s, ?p, ?o) contains a subtle inefficiency. In RDF
store, graph nodes are stored as numeric IRI IDs and literal objects can
be stored in a separate table. The call of an SQL function needs
arguments of traditional SQL datatypes, so the query processor will
extract the text of IRI for each node and the full value for each
literal object. That's significant waste of time. The workaround is

sparql select ?s ?p ?o from<some-graph>  where { ?s ?p ?o .
   FILTER (1>  <SHORT_OR_LONG::bif:rnd>  (10, ?s, ?p, ?o))  }

that tells the SPARQL front-end to omit redundand conversions of values.


Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com


On Mon, 2011-04-11 at 10:52 +0200, Konrad Höffner wrote:

Hello,

I would like to know the best method to get a random sample of all
triples for a subset of all the resources of a SPARQL endpoint, e.g.:

select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a
dbpedia-owl:Settlement} limit 5}}

The problem here is that the choice of resources returned by the sub
query "select ?s where {?s a dbpedia-owl:Settlement} limit 5" depends on
an arbitrary ordering and may thus return only Cities in the same
Country for example (which is not impossible but improbable in a random
sample), making it unsuitable for some purposes, for instance learning
tasks.

There is a similar question in the jena mailing list
http://tech.groups.yahoo.com/group/jena-dev/message/36776 but the
solution proposed there is the creation of a custom filter function in
ARQ, which to my knowledge works client side and thus is infeasible on
large knowledge bases.

Now the solution that we thought about was to do it in several steps,
namely:

1. Counting the number of relevant resources

select count(?s) as ?count where {?s a dbpedia-owl:Settlement}


2. Generating a random subset S of the set {0,..,count-1}, with a size
of n. This is easily doable in java, however the most efficient
algorithm depends on the relation of count and n because of several ways
to prevent duplicates and is discussed in this thread (German language,
however):
http://www.java-forum.org/mathematik/116289-random-sample.html

3 a. Querying the resources in a loop (Java pseudo code):

for(int s: S)

{
   String queryString = "select ?s ?p ?o where {?s ?p ?o. {select ?s where {?s a 
dbpedia-owl:Settlement} offset "+s+" limit 1}}";
   sample.addAll(query(queryString));
}


3 b. Because the execution of 3a is very slow for large enough n, a
possible optimisation would be the merge of several such queries in a union:

select ?s ?p ?o where
{?s ?p ?o.
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s1+" limit 1} UNION
{select ?s where {?s a dbpedia-owl:Settlement} offset "+s2+" limit 1} UNION
...
{select ?s where {?s a dbpedia-owl:Settlement} offset "+sn+" limit 1}

}


The problem with 3b is that a Virtuoso SPARQL query can only be a
certain size, doing this with n = 100 on the DBpedia SPARQL endpoint
resulted in the error "414 Request-URI Too Large".

Now it would be possible to send the queries in batches of 50 and merge
them afterwards but this still seems to be an awkward and slow solution
to the problem. Also as far as I understand it, leaving out the "ORDER
BY" - statement makes a different ordering for each sub query possible
which may result in some resources being duplicates. On the other hand,
using "ORDER BY ?s" in every sub query should also heavily degrade the
performance.
Thus I would like to know if there are better known solutions to get
random samples from a (Virtuoso) SPARQL endpoint and if there are
optimized solutions if the randomness does not have to stand up to
strict standards.

Two ideas we had were:

- implementing a random ordering function, but on the server side (so
that it can be used by any SPARQL query on the endpoint)
- shuffling the internal order of resources how they are returned when
using no "ORDER BY" clause

Thanks,
Konrad

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Virtuoso-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Re: [Virtuoso-users] Best way to query a random sample?

Reply via email to