[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-23 Thread finin
Thanks,  I'll try this!  Knowing which items have at least two sitelinks might 
be good enough.  I was unfamiliar with the VALUES opinion in SPARQL 1.1
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-23 Thread Imre Samu
> In the queryable Wikidata model, there is a property wikibase:sitelinks
whose value is an integer that is the number of Wikipedia sites that the
item appears on if it is on at least one site.
> This is what I'm after.  I'm not sure that this value is in the RDF dumps
and in the smaller truthy dumps, in particular.

As I see the "latest-all.nt.bz2"  contains the "sitelink" info (
downloaded from here https://dumps.wikimedia.org/wikidatawiki/entities/  )

$ bzcat latest-all.nt.bz2 | grep sitelink | head
 <
http://wikiba.se/ontology#sitelinks> "345"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "149"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "235"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "26"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "116"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "29"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "119"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "338"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "292"^^<
http://www.w3.org/2001/XMLSchema#integer> .
 <
http://wikiba.se/ontology#sitelinks> "138"^^<
http://www.w3.org/2001/XMLSchema#integer> .

>  the number of Wikipedia sites

For example the first line in my example:  Q31 = Belgium ( country in
western Europe )   https://www.wikidata.org/wiki/Q31
 <
http://wikiba.se/ontology#sitelinks> "345"^^<
http://www.w3.org/2001/XMLSchema#integer> .

*Q31.Sitelinks= 345 *
*  ==  [  Wikipedia(278 entries)*  + Wikibooks(3 entries) + Wikinews(30
entries)   + Wikiquote(12 entries) + Wikivoyage(21 entries) +  Multilingual
sites(1 entry) ]

It is not entirely clear to me that you need the "278" or the "345" as a
result.


Kind regards,
 Imre
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-23 Thread Tyler M.
If you're willing to settle for all Wikidata items with at least *two*
sitelinks (roughly 11.5 million items), it can be done with five simple
WDQS queries (these only return the QIDs though -- no labels):

SELECT?i{VALUES?s{2}?i wikibase:sitelinks?s}

SELECT?i{VALUES?s{3}?i wikibase:sitelinks?s}

(The sitelink counts are implicit for the above two queries and are omitted
from the results to help avoid a timeout or error message.)

SELECT*{VALUES?s{4 7}?i wikibase:sitelinks?s}

SELECT*{VALUES?s{5 6}?i wikibase:sitelinks?s}

SELECT*{VALUES?s{8 9 10 [...] 398 399 400}?i wikibase:sitelinks?s}

(There are a few dozen Wikimedia page-type items that have more than 400
sitelinks; these can be found here:
https://www.wikidata.org/wiki/Wikidata:Database_reports/Most_sitelinked_items
.)

Each of these queries ran successfully for me in about 20-30 seconds and I
was able to download the full results as both a TSV and JSON file without
any problems.  I had no luck with my attempts to query for the 18.4 million
items with only one sitelink, even when using LIMIT and OFFSET.

Hope that helps,

Tyler

On Tue, Mar 22, 2022 at 5:25 PM  wrote:

> Is there a simple way to get the sitelinks count data for all Wikidata
> items?  I want to use the data to help rank possible text entity links to
> Wikidata items
>
> I'm really only interested in counts for items that have at least one
> (e.g., wikibase:sitelinks value that's >0).  According to statistics I've
> seen, only about 1/3 of Wikidata items have at least one sitelink.
>
> I'm not sure if wikibase:sitelinks is included in the standard WIkidata
> dump.  I could try a SPARQL query with an OFFSET and LIMIT, but I doubt
> that the approach would work to completion.
> ___
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-22 Thread finin
In the queryable Wikidata model, there is a property wikibase:sitelinks whose 
value is an integer that is the number of Wikipedia sites that the item appears 
on if it is on at least one site.  This is what I'm after.  I'm not sure that 
this value is in the RDF dumps and in the smaller truthy dumps, in particular.
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-22 Thread Imre Samu
>  sitelinks  /  I want to use the data to help rank possible text entity
links to Wikidata items

side note:
I am helping the https://www.naturalearthdata.com/ project by adding
wikidata concordances.
it is a public domain geo-database ... with [ mountains, rivers, populated
places, .. ]
I am using wikidata json dumps - and I am importing to PostGIS database.
And I am ranking the matches with
- distance,   ( lower is better )
- text similarity ( I am checking the "labels" and the "aliases"  )
- and sitelinks!

And I am lowering the "mostly imported sitelinks" ranks ("cebwiki" , ... )
why? :
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/08#Nonsense_imported_from_Geonames

Because a lot of geodata re-imported.   And the "distance" and
"text/labels" are the same.
So be careful with the imported Wikipedia pages! ( sitelinks )
Now: As I see the geodata quality is so much better -  mostly: where the
active wikidata community is cleaning ..

it is just an example of why the simple "sitelinks" number is not enough
:-)

on the other hand:probably the P625 coordinate location is also
important.   https://www.wikidata.org/wiki/Property:P625
In Germany - the "dewiki" is higher ranks.
in Hungary  - the "huwiki" is prefered.

Kind Regards,
 Imre





 ezt írta (időpont: 2022. márc. 22., K, 22:25):

> Is there a simple way to get the sitelinks count data for all Wikidata
> items?  I want to use the data to help rank possible text entity links to
> Wikidata items
>
> I'm really only interested in counts for items that have at least one
> (e.g., wikibase:sitelinks value that's >0).  According to statistics I've
> seen, only about 1/3 of Wikidata items have at least one sitelink.
>
> I'm not sure if wikibase:sitelinks is included in the standard WIkidata
> dump.  I could try a SPARQL query with an OFFSET and LIMIT, but I doubt
> that the approach would work to completion.
> ___
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-22 Thread Imre Samu
> I'm not sure if wikibase:sitelinks is included in the standard WIkidata
dump.

As I see - it is in the JSON dump.
https://www.wikidata.org/wiki/Wikidata:Database_download#JSON_dumps_(recommended)

https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_json.html#json_sitelinks

example:

{
"sitelinks": {
"afwiki": {
"site": "afwiki",
"title": "New York Stad",
"badges": []
},
"frwiki": {
"site": "frwiki",
"title": "New York City",
"badges": []
},
"nlwiki": {
"site": "nlwiki",
"title": "New York City",
"badges": [
"Q17437796"
]
},
"enwiki": {
"site": "enwiki",
"title": "New York City",
"badges": []
},
"dewiki": {
"site": "dewiki",
"title": "New York City",
"badges": [
"Q17437798"
]
}
}
}



Kind Regards,
Imre


 ezt írta (időpont: 2022. márc. 22., K, 22:25):

> Is there a simple way to get the sitelinks count data for all Wikidata
> items?  I want to use the data to help rank possible text entity links to
> Wikidata items
>
> I'm really only interested in counts for items that have at least one
> (e.g., wikibase:sitelinks value that's >0).  According to statistics I've
> seen, only about 1/3 of Wikidata items have at least one sitelink.
>
> I'm not sure if wikibase:sitelinks is included in the standard WIkidata
> dump.  I could try a SPARQL query with an OFFSET and LIMIT, but I doubt
> that the approach would work to completion.
> ___
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-22 Thread Jan Ainali
Here is a dashboard with the number of items that do have a sitelink:
https://grafana.wikimedia.org/goto/9qEhxrP7k?orgId=1

Jan Ainali


Den tis 22 mars 2022 kl 22:25 skrev :

> Is there a simple way to get the sitelinks count data for all Wikidata
> items?  I want to use the data to help rank possible text entity links to
> Wikidata items
>
> I'm really only interested in counts for items that have at least one
> (e.g., wikibase:sitelinks value that's >0).  According to statistics I've
> seen, only about 1/3 of Wikidata items have at least one sitelink.
>
> I'm not sure if wikibase:sitelinks is included in the standard WIkidata
> dump.  I could try a SPARQL query with an OFFSET and LIMIT, but I doubt
> that the approach would work to completion.
> ___
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Easiest way to get all sitelists counts > 0?

2022-03-22 Thread Thad Guidry
Sorry, I don't have an answer for you, hopefully others respond.
When you get the answer... it would be great if you could add a new section
called "Statistics" to this page:  Help:Sitelinks - Wikidata


Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/


On Tue, Mar 22, 2022 at 4:25 PM  wrote:

> Is there a simple way to get the sitelinks count data for all Wikidata
> items?  I want to use the data to help rank possible text entity links to
> Wikidata items
>
> I'm really only interested in counts for items that have at least one
> (e.g., wikibase:sitelinks value that's >0).  According to statistics I've
> seen, only about 1/3 of Wikidata items have at least one sitelink.
>
> I'm not sure if wikibase:sitelinks is included in the standard WIkidata
> dump.  I could try a SPARQL query with an OFFSET and LIMIT, but I doubt
> that the approach would work to completion.
> ___
> Wikidata mailing list -- wikidata@lists.wikimedia.org
> To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org