[Bug 42466] Text layer of DjVu files doesn't appear in Page namespace due to higher memory consumption after upgrade to Ubuntu 12.04

bugzilla-daemon Mon, 22 Apr 2013 10:14:40 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=42466


George Orwell III <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[email protected]

--- Comment #20 from George Orwell III <[email protected]> ---
Let's back up a bit before my head explodes....

First - a DjVu is nothing more than a glorified zip file that is archiving a
bunch of other stand-alone [indirect] djvu files - an Index "file" within
directing the order viewed, any annotations, embedded hyperlinks, shared
dictionaries, typical metadata, coordinate mappings of text-layers, images,
etc. etc. for all the DjVus within it as a single [bundled] DjVu file. The
premise behind the DjVu file format is largely mirrored by the Index: and Page:
namespaces on Wikisource today.

Why it was treated like an image file rather than an archive file from day one
around here I'll never quite understand (I can peek at a a single .jpg or .txt
file compacted within a .zip file without having to exract/deflate the entire
.zip archive to do it & it doesn't re-classify the .zip file as a pic or a doc
file just because I can... So???...WtF???.... but I digress).

The point I'm trying to make is DjVus were never meant to be anything more than
an quick and easy, compact alternative to PDF files (a hack). THAT is why there
will always be issues ....

https://bugzilla.wikimedia.org/show_bug.cgi?id=8263#c3

https://bugzilla.wikimedia.org/show_bug.cgi?id=9327#c4

https://bugzilla.wikimedia.org/show_bug.cgi?id=24824#c10

https://bugzilla.wikimedia.org/show_bug.cgi?id=30751#c3

https://bugzilla.wikimedia.org/show_bug.cgi?id=21526#c16

https://bugzilla.wikimedia.org/show_bug.cgi?id=28146#c4

https://bugzilla.wikimedia.org/show_bug.cgi?id=30906#c0

<<< and I'm sure there are more; its my 1st day; sorry>>>>

... with the current "plain text dump" approach over the never fully developd
extract & parse approach. An XML of the text-layer generated via OCR is how
Archive.org does it & that is how we should be doing it too. Once the text is
in XML form - you can wipe it from the DjVu file on Commons (leaving nothing
but the image layers to pull thumnbails from) until at the very least its fixed
up by the Wikisource/WikiBooks people if not just by BOT for reinsertion if
need be.

Someone needs to revisit DjVuImage.php and finish off the extract &
convert/parse to/from XML development portion [DjVuLibre?] abandoned or
whatever because "it was too slow" 6 years ago. The current bloated text dump
will still be there to fall back on

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 42466] Text layer of DjVu files doesn't appear in Page namespace due to higher memory consumption after upgrade to Ubuntu 12.04

Reply via email to