Hi,
Regarding providing the dumps of images. It's much more complex than providing 
dumps of text. A couple of challenges that I can think of from top of my head:
 - Size: The largest text dump I could find is around 158GB. In comparison, 
images in total add up to 0.5PB. This is a 3000 times jump. Just the storage in 
a way we could serve the dump to the public will be a massive technical 
challenge. Trying to download the dump can cause the same strain on the network 
as simply scraping aggressively and it will basically just move the problem 
around.
 - Illegal content:  While we can put out the text dump of wikipedia in public 
and allow people to download it freely. We can't do the same with images. If 
even one single illegal image (CSAM) gets reported. We need to immediately 
remove that dump (slice) altogether and redo it without the reported image(s). 
It's really challenging in technical terms since the images will be bundled in 
one zipped file. Also we should probably register people who download the dumps 
so they would be notified to re-download certain slices. It's not just a 
technical challenge, but also a legally complex problem given different laws 
about this across the globe. (disclaimer: I'm not a lawyer)
 - Originals vs. thumbnails: As I said, the originals will add up to 0.5PB but 
most people don't need the originals. In fact, they are usually too large to be 
useful. Providing dump of the thumbnails could be more useful but it comes with 
its own challenges (for example, we currently have 2B thumbnails adding up to 
~250TB)

None of these are impossible to solve but they are also not trivial. It's not 
as easy as setting up an airflow DAG to just dump them somewhere.

To make the problem more manageable, We can consider this narrowing down in 
scope: When people say they want the dump of images, what they mean most of the 
time is dump of all images that are used in Wikipedia articles (which is a much 
smaller subset of the 100M images we have). The good news is that kiwix already 
provide those in their offline version of Wikipedia. It's not the original but 
it's usually the biggest possible thumbnail size (given various conditions). 
Maybe that'd be useful or "good enough" for most people?
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to