Re: [Wikidata] Source statistics

2015-09-07 Thread Edgard Marx
Is not an updated version, but

dbtrends.aksw.org

best,
Edgard

On Mon, Sep 7, 2015 at 1:25 PM, André Costa 
wrote:

> Hi all!
>
> I'm wondering if there is a way (SQL, api, tool or otherwise) for finding
> out how often a particular source is used on Wikidata.
>
> The background is a collaboration with two GLAMs where we have used ther
> open (and CC0) datasets to add and/or source statements on Wikidata for
> items on which they can be considered an authority. Now I figured it would
> be nice to give them back a number for just how big the impact was.
>
> While I can find out how many items should be affected I couldn't find an
> easy way, short of analysing each of these, for how many statements were
> affected.
>
> Any suggestions would be welcome.
>
> Some details: Each reference is a P248 claim + P577 claim (where the
> latter may change)
>
> Cheers,
> André / Lokal_Profil
> André Costa | GLAM-tekniker, Wikimedia Sverige | andre.co...@wikimedia.se
> | +46 (0)733-964574
>
> Stöd fri kunskap, bli medlem i Wikimedia Sverige.
> Läs mer på blimedlem.wikimedia.se
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Source statistics

2015-09-07 Thread Markus Krötzsch
P.S. If you want to do this yourself to play with it, below is the 
relevant information on how I wrote this code (looks a bit clumsy in 
email, but I don't have time now to set up a tutorial page ;-).


Markus


(1) I modified the example program "EntityStatisticsProcessor" that is 
part of Wikidata Toolkit [1].

(2) I added a new field to count references:

final HashMap refStatistics = new HashMap<>();

(3) The example program already downloads and processes all items and 
properties in the most recent dump. You just have to add the counting. 
Essentially, this is the code I run on every ItemDocument and 
PropertyDocument:


public void countReferences(StatementDocument statementDocument) {
  for (StatementGroup sg : statementDocument.getStatementGroups()) {
for (Statement s : sg.getStatements()) {
  for (Reference r : s.getReferences()) {
if (!refStatistics.containsKey(r)) {
  refStatistics.put(r, 1);
} else {
  refStatistics.put(r, refStatistics.get(r) + 1);
}
  }
}
  }
}

(the example already has a method "countStatements" that does these 
iterations, so you can also insert the code there).



(4) To print the output to a file, I sort the hash map by values first. 
Here's some standard code for how to do this:


try (PrintStream out = new PrintStream(
  ExampleHelpers.openExampleFileOuputStream("reference-counts.txt"))) {
List> list =
   new LinkedList>(
   refStatistics.entrySet());

 Collections.sort(list, new Comparator>()
   {
 @Override
 public int compare(Entry o1,
Entry o2) {
   return o2.getValue().compareTo(o1.getValue());
 }
   }
 );

 int singleRefs = 0;
 for (Entry entry : list) {
   if (entry.getValue() > 1) {
 out.println(entry.getValue() + " x " + entry.getKey());
   } else {
 singleRefs++;
   }
 }
 out.println("... and another " + singleRefs
 + " references that occurred just once.");
} catch (IOException e) {
  e.printStackTrace();
}

This code I put into the existing method writeFinalResults() that is 
called at the end.


As I said, this runs in about 30min on my laptop, but downloading the 
dump file first time takes a bit longer.



[1] 
https://github.com/Wikidata/Wikidata-Toolkit/blob/v0.5.0/wdtk-examples/src/main/java/org/wikidata/wdtk/examples/EntityStatisticsProcessor.java


On 07.09.2015 15:49, Markus Krötzsch wrote:

Hi André,

I just made a small counting program with Wikidata Toolkit to count
unique references. Running it on the most recent dump took about 30min.
I uploaded the results:

http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-counts-50.txt


The file lists all references that are used at least 50 times, ordered
by number of use. There were 593778 unique references for 35485364
referenced statements (out of 69942556 statements in total).

416480 of the references are used only once. If you want to see all
references used at least twice, this is a slightly longer file:

http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-counts.txt.gz


Best regards,

Markus


On 07.09.2015 13:25, André Costa wrote:

Hi all!

I'm wondering if there is a way (SQL, api, tool or otherwise) for
finding out how often a particular source is used on Wikidata.

The background is a collaboration with two GLAMs where we have used ther
open (and CC0) datasets to add and/or source statements on Wikidata for
items on which they can be considered an authority. Now I figured it
would be nice to give them back a number for just how big the impact was.

While I can find out how many items should be affected I couldn't find
an easy way, short of analysing each of these, for how many statements
were affected.

Any suggestions would be welcome.

Some details: Each reference is a P248 claim + P577 claim (where the
latter may change)

Cheers,
André / Lokal_Profil
André Costa | GLAM-tekniker, Wikimedia Sverige |andre.co...@wikimedia.se
 |+46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se 



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Source statistics

2015-09-07 Thread Markus Krötzsch

On 07.09.2015 14:25, Edgard Marx wrote:

Is not an updated version, but

dbtrends.aksw.org 


I am getting an error there. Is the server down maybe?

Markus



best,
Edgard

On Mon, Sep 7, 2015 at 1:25 PM, André Costa > wrote:

Hi all!

I'm wondering if there is a way (SQL, api, tool or otherwise) for
finding out how often a particular source is used on Wikidata.

The background is a collaboration with two GLAMs where we have used
ther open (and CC0) datasets to add and/or source statements on
Wikidata for items on which they can be considered an authority. Now
I figured it would be nice to give them back a number for just how
big the impact was.

While I can find out how many items should be affected I couldn't
find an easy way, short of analysing each of these, for how many
statements were affected.

Any suggestions would be welcome.

Some details: Each reference is a P248 claim + P577 claim (where the
latter may change)

Cheers,
André / Lokal_Profil
André Costa | GLAM-tekniker, Wikimedia Sverige
|andre.co...@wikimedia.se 
|+46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se 


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Source statistics

2015-09-07 Thread André Costa
Hi all!

I'm wondering if there is a way (SQL, api, tool or otherwise) for finding
out how often a particular source is used on Wikidata.

The background is a collaboration with two GLAMs where we have used ther
open (and CC0) datasets to add and/or source statements on Wikidata for
items on which they can be considered an authority. Now I figured it would
be nice to give them back a number for just how big the impact was.

While I can find out how many items should be affected I couldn't find an
easy way, short of analysing each of these, for how many statements were
affected.

Any suggestions would be welcome.

Some details: Each reference is a P248 claim + P577 claim (where the latter
may change)

Cheers,
André / Lokal_Profil
André Costa | GLAM-tekniker, Wikimedia Sverige | andre.co...@wikimedia.se |
+46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Source statistics

2015-09-07 Thread Stas Malyshev
Hi!

> A small fix though: I think you should better use count(?statement)
> rather than count(?ref), right?

Yes, of course, my mistake - I modified it from different query and
forgot to change it.

> I have tried a similar query on the public test endpoint on labs
> earlier, but it timed out for me (I was using a very common reference
> though ;-). For rarer references, live queries are definitely the better
> approach.

Works for me for Q216047, didn't check others though. For a popular
references, labs one may be too slow, indeed. A faster one is coming
"real soon now" :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata