| marcmiquel added a comment. |
I need all the Wikidata qitems that relate to Wikipedia articles. If I understand it correctly, these are qitems that have namespace 0. Although not all qitems with namespace 0 necessarily have sitelinks (they could be just qitems without an article).
The thing is that I'm not sure all wikidata qitems have namespace main (0).
I explain you what I did.
Since I cannot use the namespace XML tag in the dump to just parse the namespace 0 and skip the rest I managed to use the wikidata mysql replica database.
In this case, I consulted:
select count(page_namespace), page_namespace from page group by page_namespace order by 1 desc;
This is the result:
-----------------------+----------------+
| count(page_namespace) | page_namespace |
+-----------------------+----------------+
| 56986053 | 0 |
| 152250 | 1198 |
| 45022 | 3 |
| 42573 | 146 |
| 36204 | 4 |
| 33320 | 2 |
| 16541 | 1 |
| 10874 | 2600 |
| 7464 | 10 |
| 7371 | 5 |
| 5940 | 121 |
| 5887 | 120 |
| 3675 | 14 |
| 3032 | 8 |
| 1800 | 12 |
| 462 | 828 |
| 298 | 9 |
| 193 | 11 |
| 131 | 13 |
| 66 | 829 |
| 62 | 147 |
| 14 | 15 |
| 3 | 7 |
| 3 | 1199 |
+-----------------------+----------------+
So it seems that there are many pages with namespace 1198, 146, 2600...
besides 3, 4, 2, 1 which are user talk, project, user page, talk page.
I don't know how many of these are in the dump. But I only need those which are 0. So, the solution that I found is retrieving all the qitems with namespace 0 from the wikidata replica mysql database and storing them into a database.
Then I consult this database when parsing and I skip those which haven't been previously inserted. This way I the parsing is shorter.
Do you think there is any other way to do it?
Thanks.
Cc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
