2014-10-01 15:15 GMT+02:00 Jane Darnell <[email protected]>:

> I have seen many messy text-image mixes on Google books, especially older
> texts from manual typesetting days.  That's why I was wondering if it would
> be possible to have a tool that stores pages as you go, so you can step in
> and adjust it on a per page basis. I am not familiar with abbyy.xml files,
> but this may be the way to go
>

I burned out some millions of neurons while attempting to parse abbyy xml
files, since I'm not a professional programmer, but what I vaguely saw and
got is very, very exciting.  Unluckily my scripts are so rough that can't
be shared, but I'm certain that real programmers could get unbeliavable
results from such tons of data. I found too values of certainty of OCR
recognition for any character and for any word, so that uncertain words
could be highlighted when imported... or passed to a recaptcha tool. But
abbyy xml use would be a next step; what I'll like by now is simply mapped
text layer from djvu files - made simple and useful for any wikisource
user.

Alex
_______________________________________________
Wikisource-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to