> > > Text layer is stored in img_metadata, which means it can be retrieved > by the API (using ?action=query&prop=imageinfo&iiprop=metadata). > However when I tried to test this, it didn't seem to work. Maybe > trying to return the entire text layer hit some max api result size > limit or something. (It'd be really nice if we had some nicer place to > store information about files, especially for huge things like the > text layer which we don't generally want to load the entire thing all > the time. There's a bug about that somewhere in bugzilla land). > > Indirect mode (From what I can find out from google) is when you have > an index djvu file that has links to all the pages making up the djvu > file, so you can start viewing immediately and pages are only > downloaded as needed. I'm not sure how such a format would work in > terms of uploading it. Unless we convert it on the server side, how > would we upload all the constitutiant files (I suppose we could tell > people to upload tarballs. Then we have to make sure to validate the > contents, and communicate to people that the tarball is only for > uploaded djvu files). [Of course until 5 minutes ago I'd never heard > of an indirect djvu file, so I could be misunderstanding] > > -bawolff >
I use a lot djvuLibre library on my pc, both from console and from python scripts; so I can tell you that it will be very simple to convert a "bundled" djvu file into an "indirect" file. Obviously this should be transparent for uploader, being a server fully automatic job. About text layer: it's very, very interesting even if complex. There are command-line DjvuLibre routines to do anything you want, both to read and to edit it. What we get is simply the most banal output (full text); from any IA djvu file you can get much more, t.i. gerarchic text structure (al page, column, region paragraph, line, and single word detail) with coordinates of any element at any detail level; but you can get/insert too structured metadata, both as "global metadata" and page-specific metadata. Any djvu extraction/editing function runs both on bundled and on indirect djvu file, and obviosuly any read/edit is much faster when a small, single-page file is addressed. Coordinates of text elements and gerarchic structure of text are extremely interesting, since such set of data could be used to "guess formatting": ie you could "guess" centered text, tables, sections alignment, headers/footers, poems, paragraphs, and font-sizes too. Inter line spacing could be used to "guess" chapter titles. "Empty text areas" are often simply areas covered by illustrations, so that an intelligent algorithm could guess their size and position. I imagine that thumbnail generation/purging too would be much more effective and fast. In brief, we have a Ferrari but are using it with a speed limit of 10 miles/hour. :-) Alex _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l