[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2018-11-20 Thread MichaelSchoenitzer
MichaelSchoenitzer added a comment.
I had the same problem and solved it by using the API with the generator exturlusage.TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: MichaelSchoenitzerCc: MichaelSchoenitzer, Nikki, MisterSynergy, Esc3300, Smalyshev, Aklapper, Jheald, ET4Eva, Nandana, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Avner, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-03-27 Thread Smalyshev
Smalyshev added a comment.
Yes, it uses B-trees. But what is stored in these trees is not strings. It's not actually hashes, it's term values I think. Not sure about how it is encoded, but I don't see an easy way to do prefix matches there.TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Esc3300, Smalyshev, Aklapper, Jheald, QZanden, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-03-24 Thread Jheald
Jheald added a comment.
So you're saying, in effect, I should think of the strings being stored as a great big hash table rather than a B-tree, so there's nothing there that can help even STRSTARTS.  And of course I know very little about the internals of BlazeGraph, whereas you've actually written modules for it.  But I did think BlazeGraph uses B+ trees, which do specifically facilitate rapid in-order traversal and retrieval.  But perhaps only to retrieve hashes, not strings?TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JhealdCc: Smalyshev, Aklapper, Jheald, QZanden, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-02-14 Thread Jheald
Jheald added a comment.
Hi Smalyshev, thanks for taking the time to get back to me.

The LDF suggestion is a good one -- I was thinking of looking in to it for investigating Commons sitelinks and P373s, both of which are now getting close to the borderline of what a query can cope with without timing out, with little time left for filtering.  For these it might be quite useful to have all million downloaded locally.   But it seems like overkill, if all one wants to do is a quick query to check on what is referencing a particular site, with maybe only a couple of hundred hits expected, or in the low thousands; but to be able to do quite general analyses on that (and share them with other people) -- eg what classes of items are showing references to this given website.  I haven't used LDF, but it seems it would be turning what would be quite a simple query into quite a substantial scripting job.  But it's a good suggestion.

Regarding "contains" I do take the point.  Full text indexing is (I believe) technically possible using suffix trees, and I think Blazegraph even offer it as an option.  As I understand it though, it could increase the storage requirement for text fields by a factor of anything up to 20.  I don't know how much of the database is accounted for by text fields, but I could understand a reluctance not to go down that route.

On the other hand, I was a bit disappointed that the query wasn't helped by switching to "strstarts".  As I understand it, Blazegraph does do basic indexing on all fields, which is essential for it to be able to rapidly find and retrieve a fully specified value.   This is what makes it possible for a look-up for a particular url to be very very fast, eg this query

SELECT ?ref  WHERE {
  ?ref pr:P854 
}

What I was hoping was that that indexing might survive (at least until any joins are done) when one identifies the values of pr:P 854 to a variable,

?ref pr:P854 ?url

If the indexing did survive, then (at least in principle) one might hope that it might then be accessible by the implementation of STRSTARTS, so that when presented with

STRSTARTS( ?url, 'http://artuk.org/discover/artists')

Blazegraph would be able to rapidly identify the first match from this index on ?url; the first non-match after those matches; and then rapidly extract everything in between for its solution set.

Of course with string matching on URLs, there is a complication that STRSTARTS only works on strings, so that one also has to do an str() cast, and therefore to make it work the line above thus has to read

STRSTARTS( str(?url),  'http://artuk.org/discover/artists')

But if the URLs were already indexed using eg a basically alphabetical B-tree, it might be possible to do the cast to string and still preserve an indexing, so it might still be possible to execute the line above with indexing, and for it therefore to still be very fast.

So that's what I had in mind, wondering whether it might be possible for a degree of indexing to somehow be preserved through the query execution, so that the STRSTARTS match might sometimes be very fast.

Of course you're right that for URLs, the domain part of the URL eg figaro.fr or artuk.org would almost always be part of what one would want to filter for, so if it could be possible to filter rapidly on that, then the remaining solution set would very likely be small enough to use conventional string operators.  So this could be a great functionality to have, if such a thing could be implemented easily, perhaps as an extra triple on reference objects.

But I can't help thinking that this kind of STRSTARTS request must be something that comes up often enough, in enough different production contexts, that it might be worth Blazegraph's time to see whether some kind of indexing-based acceleration like the above thought-outline could sometimes be possible.TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JhealdCc: Smalyshev, Aklapper, Jheald, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-02-12 Thread Smalyshev
Smalyshev added a comment.
I'm not sure what you mean by "proper indexing". If you ask to go through 2.3 millions of items, it's probably going to take time, and applying operations to it is probably not covered by index. I'm not sure maybe something can be done for prefix matches (just guessing), but for operations like "contains" I don't see any way around fetching each item and applying the operation to it.

One alternative for such queries can be using LDF endpoint: https://query.wikidata.org/bigdata/ldf?subject==http%3A%2F%2Fwww.wikidata.org%2Fprop%2Freference%2FP854=
and processing the matches client-side. LDF endpoint is fast, but requires the client to load a lot of data and process it.TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Smalyshev, Aklapper, Jheald, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-02-12 Thread Jheald
Jheald added a comment.
For comparison, in September 2015 there were 376,524 references using P:854, and a query to search them could complete in 3.7 seconds -- it was used as an example on the query optimisation page.  Such a query now times out.

It should be possible for such a search to be very efficient with proper indexing on the values of P:854.  But is such indexing being sidelined by the cast to string from URL type?  Would it make sense to hold URLs in both URL and string form, to make indexing more possible?TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JhealdCc: Aklapper, Jheald, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-02-10 Thread Jheald
Jheald added a comment.
For completeness, the P:854 can also be used as a qualifier, principally on property P1343 "described by source", though the number of uses is much much smaller.

This usage can be searched, eg this query works:

SELECT ?item ?itemLabel ?refURL WHERE {
  ?item ?prop ?stmt .
  ?stmt pq:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'//www.british-history.ac.uk/vch/')) . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
} LIMIT 200

However, that is presumably because there are only 8615 such uses in all. (live count).TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JhealdCc: Aklapper, Jheald, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T157811: Wikidata reference URIs have become too many to search with WDQS SPARQL

2017-02-10 Thread Jheald
Jheald added a comment.
Some queries:

Count of the number of uses of the property -- takes over 10 seconds

SELECT (COUNT(?ref) AS ?count) WHERE {
  ?ref pr:P854 ?refURL .
}

Attempt to extract some references to Le Figaro website -- fails -- though it does seem to try to be creating a JSON list of something

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .   
} LIMIT 200

Attempt to use STRSTARTS to possibly save time in the string searching -- fails -- error console appears to contain a similar broken JSON list as before

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.lefigaro.fr')) .   
} LIMIT 200

Attempt just to count the number of such instances -- still times out

SELECT (COUNT(?ref) AS ?count) WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.lefigaro.fr')) .   
}

Attempt to find a meaningless string, to simulate a website used very few times -- still fails to complete

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.pqrstuvwxyz')) .   
} LIMIT 200

The first search for Le Figaro used to work -- back when the WDQS SPARQL service was first enabled.

I don't know if something has changed since then in BlazeGraph or the WDQS set-up, or whether I am seeing this just because the number of uses of P:854 may have increased since that time; but not being able to extract at scale the items that reference particular sources makes the property and its data very much less useful.TASK DETAILhttps://phabricator.wikimedia.org/T157811EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JhealdCc: Aklapper, Jheald, EBjune, merbst, Avner, debt, Gehel, D3r1ck01, Jonas, FloNight, Xmlizer, Izno, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs