aborrero added a comment.

I just had a videoconf with people @abian and Ruben Ojeda from Wikimedia Spain.

Some conclusions:

  • IGN offers a lot of data, in many different formats. @abian or someone else should get an idea on how to post-process these files to a format understandable by Commons.
  • We agreed on trying a 200GB VM for data processing before uploading to commons, and work by small chunks of data. Of these 200GB, 100GB is for the raw download, and 100GB for the post-process output before uploading to common. After a chunk is processed, the storage is cleaned to left space for next chunk.
  • Apparently IGN doesn't have an API or other structured web URL for us to download the data using a script. They use some custom POST parameters, and we would need some information on them before we can script those.
  • If we can't automate the download, there is an option to go to the IGN datacenter, plug a hard disk and fetch all the data without using the network. Once we have this hard disk we could either send it to a WMF datacenter or @abian can upload it from his home to our VM.

So, there are 2 different issues here:

  • How to fetch the data from IGN (web API, http POST, hard disk, etc)
  • How to process the data we fetched from IGN

In case we discover IGN has an API (or @abian can script the http POST easily) we could even think on having this pipeline build on Toolforge in our Grid Engine (download small chunk -> process -> upload to commons -> start again) .


TASK DETAIL
https://phabricator.wikimedia.org/T195121

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aborrero
Cc: fgiunchedi, Reedy, bd808, Aklapper, aborrero, SandraF_WMF, Platonides, Rodelar, abian, AndyTan, sietec, Zylc, 1978Gage2001, Lahi, PDrouin-WMF, Gq86, E1presidente, Ramsey-WMF, Cparle, Anooprao, GoranSMilovanovic, Chicocvenancio, QZanden, Tbscho, Tramullas, Acer, LawExplorer, JJMC89, Susannaanas, srodlund, Luke081515, Aschroet, Jane023, Wikidata-bugs, Base, matthiasmullie, aude, Gryllida, Ricordisamoa, Lydia_Pintscher, Fabrice_Florin, Raymond, scfc, Steinsplitter, Mbch331, Krenair, chasemp
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to