hoo added a comment.
In https://phabricator.wikimedia.org/T74678#2179243, @pere_prlpz wrote: > I'm afraid there may still be some millions of duplicates in Wikidata json dump. > > According to its main page, there are 17.209.354 items in Wikidata. I downloaded a Wikidata entities json dump a few days ago and I expected to find the same number of items there. However, I counted about 20,568,190 lines, 20,565,957 of which are items (e.g. entities with an id starting with "Q"). I suppose this 3 million items excess is made of duplicate items. > > Anyway, I haven't been able to find any actual duplicate, but it's hard to search for duplicates in a 69Gb file. > > If it matters, the dump I used was latest-all.json.bz2 from https://dumps.wikimedia.org/wikidatawiki/entities/ , with date march 28 2016. That is not a bug, the item count on the main page is only counting items with at least one sitelink or at least one statement (or something along these lines), but the dump contains all items. Also the number on the main page is slightly off all the time (due to caching and they way it is incremented/ decremented). TASK DETAIL https://phabricator.wikimedia.org/T74678 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel, hoo Cc: pere_prlpz, Wikidata-bugs, scfc, ori, aude, Lydia_Pintscher, jeremyb-phone, daniel, Jefft0, hoo, Unknown Object (MLST), Lewizho99, D3r1ck01, Izno, Mbch331 _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
