You could try to programatically match up each hOCR text block to a
corresponding fragment from the transcripts, based on textual similarity
(then replace the hOCR text with the "real" text). There's monotonicity in
terms of XY coordinates vs offset in the transcript, i.e. (X1,Y1) < (X2,Y2)
=> text
But Raffaele, how do you generate the hOCR in the first place if you're
using human-generated transcripts and not OCR? Hand coding each page would
take forever.
On Fri, Jan 17, 2014 at 3:24 AM, raffaele messuti <
raffaele.mess...@gmail.com> wrote:
> Padraic Stack wrote:
> > What is a straightfo
Padraic Stack wrote:
> What is a straightforward way to combine the text with overlaid images
> to create searchable pdfs?
having transcription in hOCR[1] format the tool you should need is
hocr2pdf[2].
i never tried for pdfs, years ago i made some djvu following this
tutorial[3]
[1] http://en.wi
I don't think I can answer your question but I we have a similar problem.
I'm not sure about all OCR programs, but the version of Tesseract I've seen
in Islandora creates two files, one is the .txt file you would expect and
the other is an hOCR file with very interesting mark up linking words in
t
Hi folks,
I have a number of typescript / manuscript images on which it is quite
time consuming to run OCR. (Or more accurately it is quite time
consuming to correct the OCR).
For some of these I have text files containing accurate transcriptions.
In other cases I have TEI files with these t