I have seen many messy text-image mixes on Google books, especially older
texts from manual typesetting days.  That's why I was wondering if it would
be possible to have a tool that stores pages as you go, so you can step in
and adjust it on a per page basis. I am not familiar with abbyy.xml files,
but this may be the way to go

On Wed, Oct 1, 2014 at 2:18 PM, Alex Brollo <[email protected]> wrote:

> 2014-10-01 9:18 GMT+02:00 Jane Darnell <[email protected]>:
>
>> Actually, I would rather have a tool that pulls apart djvu files as they
>> are uploaded; keeping the text in WS and the pics in Commons
>>
>
>
> This is very interesting since abbyy.xml files contain both a full detail
> (character by character) detail of text mapping & format, and coordinates
> of any not-textual content (illustrations) of the scanned page. Using
> appropriately such data, it would be possible to extract automatically
> illustrations and other graphical elements of pages. nevertheless, I saw
> that such "self-cropping" of illustration sometimes fails, and often is
> confused by some unusual format of illustrations/graphical element, so that
> many "illustrations" are nonsense or have to be cropped again. Unluckily,
> djvu files have no such "illustration coordinates" inside.
>
>
> Alex
>
> _______________________________________________
> Wikisource-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to