2014-10-01 9:18 GMT+02:00 Jane Darnell <[email protected]>:

> Actually, I would rather have a tool that pulls apart djvu files as they
> are uploaded; keeping the text in WS and the pics in Commons
>


This is very interesting since abbyy.xml files contain both a full detail
(character by character) detail of text mapping & format, and coordinates
of any not-textual content (illustrations) of the scanned page. Using
appropriately such data, it would be possible to extract automatically
illustrations and other graphical elements of pages. nevertheless, I saw
that such "self-cropping" of illustration sometimes fails, and often is
confused by some unusual format of illustrations/graphical element, so that
many "illustrations" are nonsense or have to be cropped again. Unluckily,
djvu files have no such "illustration coordinates" inside.


Alex
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to