[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-16 Thread Addshore
Addshore added a comment. As far as I can see we are now covering all of the parts of wikidata-todo/stats that we wanted! TASK DETAIL https://phabricator.wikimedia.org/T117234 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Christopher, Addshore

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-08 Thread Addshore
Addshore added a comment. Okay, I'm struggling to see which part of the todo stats this is covering TASK DETAIL https://phabricator.wikimedia.org/T117234 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Christopher, Addshore Cc: Wikidata-bugs,

Re: [Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-08 Thread Christopher Johnson
Obviously, a main aspect of the data presented in the todo stats is "referenced statements". (even though the chart labels there are wrong). Whether or not this query maps directly to todo is actually not the key issue. Clearly, measuring data quality requires that the arity of statement to

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-04 Thread Christopher
Christopher added a comment. @Addshore Some progress was made on this in https://phabricator.wikimedia.org/T120166. The only "practical" way to get the statement and reference metrics is to facet the data by property. It is just not possible to run counting queries against the whole database

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-04 Thread Addshore
Addshore added a comment. You above query is slightly off somewhere and the below is actually correct! PREFIX wikibase: PREFIX wd: PREFIX wdt: PREFIX rdfs:

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-03 Thread Addshore
Addshore added a comment. You do need distinct if you want the correct number there! I was simply pointing out that distinct is what makes the query a long one, not actually the count. I think the issue with potential duplication is being addressed and the datasets are being rebuilt this week

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-02 Thread Christopher
Christopher added a comment. The only way to get a count of statements with references in the current model/format is like this: PREFIX wd: PREFIX wdt: PREFIX prov: SELECT

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-12-01 Thread Addshore
Addshore added a comment. So lots of this is now done using the query service. We need to assess what has been missed / is missing and doesn't already have a ticket on the board TASK DETAIL https://phabricator.wikimedia.org/T117234 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-26 Thread Christopher
Christopher added a comment. I am blocked on this by several problems with the data model/ontology. The question of the relationship of the data model and the RDF node definitions is a bit complicated, perhaps more so than it should be. A reference is a special type of statement defined by

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-21 Thread Addshore
Addshore added a comment. In https://phabricator.wikimedia.org/T117234#1820362, @Christopher wrote: > OK. So the title "Referenced Statements by Statement Type" is just wrong > then. Rather, it shows **All Statements ** by Type" > > | Date | itemlink | string |globecoordinate |

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-21 Thread Christopher
Christopher added a comment. Truthy statement counts per Item can be done like this: PREFIX wd: SELECT (count(distinct(?o)) AS ?ocount) WHERE { wd:Q7239 ?p ?o FILTER(STRSTARTS(STR(?p), "http://www.wikidata.org/prop/direct;)) } Labels per

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-20 Thread Christopher
Christopher added a comment. OK. So the title "Referenced Statements by Statement Type" is just wrong then. Rather, it shows **All Statements ** by Type" | Date | itemlink | string | globecoordinate | time | quantity | somevalue | novalue | Total | | 2015-10-19 |

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-20 Thread Addshore
Addshore added a comment. > What is still murky to me, and I think possibly wrong with the todo/stats > data, is the "Referenced statements by statement type". Something does not > add up there because the total should not be greater than the sum of > "Statements referenced to Wikipedia by

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-20 Thread Addshore
Addshore added a comment. > to find out how many statements do not have references is currently not > possible. We may not actually need this, for example if we know the number of items, and the number of referenced statements we must know the number of unreferenced statements. TASK DETAIL

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-20 Thread Christopher
Christopher added a comment. True, a statement is either referenced or "unreferenced". Getting the number of referenced statements (currently 41,735,203) is easy and fast with: curl -G https://query.wikidata.org/bigdata/namespace/wdq/sparql --data-urlencode ESTCARD --data-urlencode

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-20 Thread Christopher
Christopher added a comment. OK. I may have found an answer to the question of wildcard "Prefix Matching" that is necessary in order to query for number of statements in an item. PREFIX bds: prefix wikibase: SELECT

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-19 Thread Christopher
Christopher added a comment. Yes. It seems I need to disable the 10 minute query timeout set here first: https://github.com/wikimedia/wikidata-query-rdf/blob/b3e646284f0b74131bce99a1b7d5fc6bfe675ec1/war/src/config/web.xml#L55 A fat query like this: PREFIX wikibase:

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-19 Thread Addshore
Addshore added a comment. Any progress here? TASK DETAIL https://phabricator.wikimedia.org/T117234 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Christopher, Addshore Cc: Lydia_Pintscher, StudiesWorld, Addshore, Christopher, Aklapper,

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-09 Thread Christopher
Christopher added a comment. No. the blocking task code enables an option to not filter item, statement, value and reference rdf:types in the munger. I decided not to wait for this, so that I could get started, but having it in master is very helpful going forward. In order to have these

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-09 Thread Addshore
Addshore added a comment. As the above blocking task has been resolved is it possible to perform these on the live query service? TASK DETAIL https://phabricator.wikimedia.org/T117234 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Christopher,

[Wikidata-bugs] [Maniphest] [Commented On] T117234: Reproduce wikidata-todo/stats data using analytics infrastructure

2015-11-07 Thread Christopher
Christopher added a comment. Update: All data loaded into Blazegraph (it took over 24 hours). Sync now running and up to 27 October. Using Fast Range Counts returns counts of content objects instantly. Examples: curl -G http://wdm-rdf.wmflabs.org/bigdata/namespace/wdq/sparql