Hi, Regarding providing the dumps of images. It's much more complex than providing dumps of text. A couple of challenges that I can think of from top of my head: - Size: The largest text dump I could find is around 158GB. In comparison, images in total add up to 0.5PB. This is a 3000 times jump. Just the storage in a way we could serve the dump to the public will be a massive technical challenge. Trying to download the dump can cause the same strain on the network as simply scraping aggressively and it will basically just move the problem around. - Illegal content: While we can put out the text dump of wikipedia in public and allow people to download it freely. We can't do the same with images. If even one single illegal image (CSAM) gets reported. We need to immediately remove that dump (slice) altogether and redo it without the reported image(s). It's really challenging in technical terms since the images will be bundled in one zipped file. Also we should probably register people who download the dumps so they would be notified to re-download certain slices. It's not just a technical challenge, but also a legally complex problem given different laws about this across the globe. (disclaimer: I'm not a lawyer) - Originals vs. thumbnails: As I said, the originals will add up to 0.5PB but most people don't need the originals. In fact, they are usually too large to be useful. Providing dump of the thumbnails could be more useful but it comes with its own challenges (for example, we currently have 2B thumbnails adding up to ~250TB)
None of these are impossible to solve but they are also not trivial. It's not as easy as setting up an airflow DAG to just dump them somewhere. To make the problem more manageable, We can consider this narrowing down in scope: When people say they want the dump of images, what they mean most of the time is dump of all images that are used in Wikipedia articles (which is a much smaller subset of the 100M images we have). The good news is that kiwix already provide those in their offline version of Wikipedia. It's not the original but it's usually the biggest possible thumbnail size (given various conditions). Maybe that'd be useful or "good enough" for most people? _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/