Smalyshev created this task.
Smalyshev added projects: Wikidata, Commons, SDC General, 
Wikidata-Query-Service, WikibaseMediaInfo, Dumps-Generation.

TASK DESCRIPTION
  I did a small test of RDF dump generation for SDC/mediainfo. Elasticsearch 
data shows that there are about 500k files on Commons with labels and about 
850k files with statements (these largely intersect). The way we dump entities 
right now, we scan all the files (page IDs) and skip those that do not have 
structured data. However, as right now only about 2% of files has data, so it 
is very wasteful process - we process 100 pages to find one proper mediainfo 
entity, essentially. We may want to find a way to do better, though not sure 
that current classes allow it - we may have to implement some special class 
instead of SqlEntityIdPager.
  
  I tried dumping 100K mediainfo entities, and that took 166.5 minutes. On one 
hand, given that we can parallelize, if we split it into 8 shards, we might be 
done in reasonable time. On the other hand, average of 10 items per second is 
too slow. If we expect coverage of files with mediainfo to increase 
significantly (e.g. 10x and more) then it's maybe not that big of a deal 
(though T222497: dumpRDF for MediaInfo entities loads each page individually 
<https://phabricator.wikimedia.org/T222497>) still remains a factor but as it 
is now, RDF dumping process for mediainfo is very inefficient.

TASK DETAIL
  https://phabricator.wikimedia.org/T230856

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Lucas_Werkmeister_WMDE, Tpt, Addshore, Ramsey-WMF, Lydia_Pintscher, 
Aklapper, WMDE-leszek, ArielGlenn, hoo, Smalyshev, darthmon_wmde, DannyS712, 
Nandana, JKSTNK, Lahi, PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, 
SandraF_WMF, GoranSMilovanovic, Lunewa, QZanden, EBjune, Tramullas, Acer, 
merbst, LawExplorer, Salgo60, Silverfish, Poyekhali, _jensen, rosalieper, 
Morgankevinj, Taiwania_Justo, Jonas, Xmlizer, Susannaanas, Ixocactus, 
Wong128hk, gnosygnu, Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, 
matthiasmullie, aude, Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, 
Ricordisamoa, Wesalius, Fabrice_Florin, Raymond, Jdforrester-WMF, 
Steinsplitter, Mbch331, Keegan
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to