marcmiquel added a comment.

I need all the Wikidata qitems that relate to Wikipedia articles. If I understand it correctly, these are qitems that have namespace 0. Although not all qitems with namespace 0 necessarily have sitelinks (they could be just qitems without an article).

The thing is that I'm not sure all wikidata qitems have namespace main (0).

I explain you what I did.

Since I cannot use the namespace XML tag in the dump to just parse the namespace 0 and skip the rest I managed to use the wikidata mysql replica database.

In this case, I consulted:
select count(page_namespace), page_namespace from page group by page_namespace order by 1 desc;

This is the result:

-----------------------+----------------+

count(page_namespace)page_namespace

+-----------------------+----------------+

569860530
1522501198
450223
42573146
362044
333202
165411
108742600
746410
73715
5940121
5887120
367514
30328
180012
462828
2989
19311
13113
66829
62147
1415
37
31199

+-----------------------+----------------+

So it seems that there are many pages with namespace 1198, 146, 2600...
besides 3, 4, 2, 1 which are user talk, project, user page, talk page.

I don't know how many of these are in the dump. But I only need those which are 0. So, the solution that I found is retrieving all the qitems with namespace 0 from the wikidata replica mysql database and storing them into a database.

Then I consult this database when parsing and I skip those which haven't been previously inserted. This way I the parsing is shorter.

Do you think there is any other way to do it?
Thanks.


TASK DETAIL
https://phabricator.wikimedia.org/T191639

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: marcmiquel
Cc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to