https://bugzilla.wikimedia.org/show_bug.cgi?id=42466
George Orwell III <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #20 from George Orwell III <[email protected]> --- Let's back up a bit before my head explodes.... First - a DjVu is nothing more than a glorified zip file that is archiving a bunch of other stand-alone [indirect] djvu files - an Index "file" within directing the order viewed, any annotations, embedded hyperlinks, shared dictionaries, typical metadata, coordinate mappings of text-layers, images, etc. etc. for all the DjVus within it as a single [bundled] DjVu file. The premise behind the DjVu file format is largely mirrored by the Index: and Page: namespaces on Wikisource today. Why it was treated like an image file rather than an archive file from day one around here I'll never quite understand (I can peek at a a single .jpg or .txt file compacted within a .zip file without having to exract/deflate the entire .zip archive to do it & it doesn't re-classify the .zip file as a pic or a doc file just because I can... So???...WtF???.... but I digress). The point I'm trying to make is DjVus were never meant to be anything more than an quick and easy, compact alternative to PDF files (a hack). THAT is why there will always be issues .... https://bugzilla.wikimedia.org/show_bug.cgi?id=8263#c3 https://bugzilla.wikimedia.org/show_bug.cgi?id=9327#c4 https://bugzilla.wikimedia.org/show_bug.cgi?id=24824#c10 https://bugzilla.wikimedia.org/show_bug.cgi?id=30751#c3 https://bugzilla.wikimedia.org/show_bug.cgi?id=21526#c16 https://bugzilla.wikimedia.org/show_bug.cgi?id=28146#c4 https://bugzilla.wikimedia.org/show_bug.cgi?id=30906#c0 <<< and I'm sure there are more; its my 1st day; sorry>>>> ... with the current "plain text dump" approach over the never fully developd extract & parse approach. An XML of the text-layer generated via OCR is how Archive.org does it & that is how we should be doing it too. Once the text is in XML form - you can wipe it from the DjVu file on Commons (leaving nothing but the image layers to pull thumnbails from) until at the very least its fixed up by the Wikisource/WikiBooks people if not just by BOT for reinsertion if need be. Someone needs to revisit DjVuImage.php and finish off the extract & convert/parse to/from XML development portion [DjVuLibre?] abandoned or whatever because "it was too slow" 6 years ago. The current bloated text dump will still be there to fall back on -- You are receiving this mail because: You are on the CC list for the bug. You are the assignee for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
