[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)

marcmiquel Fri, 15 Feb 2019 11:09:32 -0800

marcmiquel added a comment.

I need all the Wikidata qitems that relate to Wikipedia articles. If I understand it correctly, these are qitems that have namespace 0. Although not all qitems with namespace 0 necessarily have sitelinks (they could be just qitems without an article).

The thing is that I'm not sure all wikidata qitems have namespace main (0).

I explain you what I did.

Since I cannot use the namespace XML tag in the dump to just parse the namespace 0 and skip the rest I managed to use the wikidata mysql replica database.

In this case, I consulted:
select count(page_namespace), page_namespace from page group by page_namespace order by 1 desc;

This is the result:

-----------------------+----------------+

count(page_namespace)

page_namespace

+-----------------------+----------------+

56986053	0
152250	1198
45022	3
42573	146
36204	4
33320	2
16541	1
10874	2600
7464	10
7371	5
5940	121
5887	120
3675	14
3032	8
1800	12
462	828
298	9
193	11
131	13
66	829
62	147
14	15
3	7
3	1199

+-----------------------+----------------+

So it seems that there are many pages with namespace 1198, 146, 2600...
besides 3, 4, 2, 1 which are user talk, project, user page, talk page.

I don't know how many of these are in the dump. But I only need those which are 0. So, the solution that I found is retrieving all the qitems with namespace 0 from the wikidata replica mysql database and storing them into a database.

Then I consult this database when parsing and I skip those which haven't been previously inserted. This way I the parsing is shorter.

Do you think there is any other way to do it?
Thanks.

TASK DETAIL

https://phabricator.wikimedia.org/T191639

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: marcmiquel
Cc: Addshore, Chicocvenancio, marcmiquel, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, gnosygnu, Wikidata-bugs, aude, Svick, Mbch331, jeremyb

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T191639: Wikidata JSON dumps do not have the 'ns' (namespace)

Reply via email to