Re: [Wikitech-l] Full support for djvu file

Alex Brollo Thu, 03 May 2012 06:23:06 -0700

>
>
> Text layer is stored in img_metadata, which means it can be retrieved
> by the API (using ?action=query&prop=imageinfo&iiprop=metadata).
> However when I tried to test this, it didn't seem to work. Maybe
> trying to return the entire text layer hit some max api result size
> limit or something. (It'd be really nice if we had some nicer place to
> store information about files, especially for huge things like the
> text layer which we don't generally want to load the entire thing all
> the time. There's a bug about that somewhere in bugzilla land).
>
> Indirect mode (From what I can find out from google) is when you have
> an index djvu file that has links to all the pages making up the djvu
> file, so you can start viewing immediately and pages are only
> downloaded as needed. I'm not sure how such a format would work in
> terms of uploading it. Unless we convert it on the server side, how
> would we upload all the constitutiant files (I suppose we could tell
> people to upload tarballs. Then we have to make sure to validate the
> contents, and communicate to people that the tarball is only for
> uploaded djvu files). [Of course until 5 minutes ago I'd never heard
> of an indirect djvu file, so I could be misunderstanding]
>
> -bawolff
>


I use a lot djvuLibre library on my pc, both from console and from python
scripts; so I can tell you that it will be very simple to convert a
"bundled" djvu file into an "indirect" file. Obviously this should be
 transparent for uploader, being a server fully automatic job.

About text layer: it's very, very interesting even if complex. There are
command-line DjvuLibre routines to do anything you want, both to read and
to edit it. What we get is simply the most banal output (full text); from
any IA djvu file you can get much more, t.i. gerarchic text structure (al
page, column, region paragraph, line, and single word detail) with
coordinates of any element at any detail level; but you can get/insert too
structured metadata, both as "global metadata" and page-specific metadata.

Any djvu extraction/editing function runs both on bundled and on indirect
djvu file, and obviosuly any read/edit is much faster when a small,
single-page file is addressed.

Coordinates of text elements and gerarchic structure of text are extremely
interesting, since such set of data could be used to "guess formatting": ie
you could "guess" centered text, tables, sections alignment,
headers/footers, poems, paragraphs, and font-sizes too. Inter line spacing
could be used to "guess" chapter titles. "Empty text areas" are often
simply areas covered by illustrations, so that an intelligent algorithm
could guess their size and position.

I imagine that thumbnail generation/purging too would be much more
effective and fast.

In brief, we have a Ferrari but are using it with a speed limit of 10
miles/hour. :-)

Alex
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Full support for djvu file

Reply via email to