hoo added a comment.

  In https://phabricator.wikimedia.org/T74678#2179243, @pere_prlpz wrote:
  
  > I'm afraid there may still be some millions of duplicates in Wikidata json 
dump.
  >
  > According to its main page, there are 17.209.354 items in Wikidata. I 
downloaded a Wikidata entities json dump a few days ago and I expected to find 
the same number of items there. However, I counted about 20,568,190 lines, 
20,565,957 of which are items (e.g. entities with an id starting with "Q"). I 
suppose this 3 million items excess is made of duplicate items.
  >
  > Anyway, I haven't been able to find any actual duplicate, but it's hard to 
search for duplicates in a 69Gb file.
  >
  > If it matters, the dump I used was latest-all.json.bz2 from 
https://dumps.wikimedia.org/wikidatawiki/entities/ , with date march 28 2016.
  
  
  That is not a bug, the item count on the main page is only counting items 
with at least one sitelink or at least one statement (or something along these 
lines), but the dump contains all items. Also the number on the main page is 
slightly off all the time (due to caching and they way it is incremented/ 
decremented).

TASK DETAIL
  https://phabricator.wikimedia.org/T74678

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, hoo
Cc: pere_prlpz, Wikidata-bugs, scfc, ori, aude, Lydia_Pintscher, jeremyb-phone, 
daniel, Jefft0, hoo, Unknown Object (MLST), Lewizho99, D3r1ck01, Izno, Mbch331



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to