marcmiquel added a comment. |
I need all the Wikidata qitems that relate to Wikipedia articles. If I understand it correctly, these are qitems that have namespace 0. Although not all qitems with namespace 0 necessarily have sitelinks (they could be just qitems without an article).
The thing is that I'm not sure all wikidata qitems have namespace main (0).
I explain you what I did.
Since I cannot use the namespace XML tag in the dump to just parse the namespace 0 and skip the rest I managed to use the wikidata mysql replica database.
In this case, I consulted:
select count(page_namespace), page_namespace from page group by page_namespace order by 1 desc;
This is the result:
-----------------------+----------------+
count(page_namespace) | page_namespace |
+-----------------------+----------------+
56986053 | 0 |
152250 | 1198 |
45022 | 3 |
42573 | 146 |
36204 | 4 |
33320 | 2 |
16541 | 1 |
10874 | 2600 |
7464 | 10 |
7371 | 5 |
5940 | 121 |
5887 | 120 |
3675 | 14 |
3032 | 8 |
1800 | 12 |
462 | 828 |
298 | 9 |
193 | 11 |
131 | 13 |
66 | 829 |
62 | 147 |
14 | 15 |
3 | 7 |
3 | 1199 |
+-----------------------+----------------+
So it seems that there are many pages with namespace 1198, 146, 2600...
besides 3, 4, 2, 1 which are user talk, project, user page, talk page.
I don't know how many of these are in the dump. But I only need those which are 0. So, the solution that I found is retrieving all the qitems with namespace 0 from the wikidata replica mysql database and storing them into a database.
Then I consult this database when parsing and I skip those which haven't been previously inserted. This way I the parsing is shorter.
Do you think there is any other way to do it?
Thanks.
Cc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs