Re: [CODE4LIB] Creating pdfs from images and their text

2014-01-18 Thread Dan Muresan
You could try to programatically match up each hOCR text block to a corresponding fragment from the transcripts, based on textual similarity (then replace the hOCR text with the "real" text). There's monotonicity in terms of XY coordinates vs offset in the transcript, i.e. (X1,Y1) < (X2,Y2) => text

Re: [CODE4LIB] Creating pdfs from images and their text

2014-01-17 Thread Daron Dierkes
But Raffaele, how do you generate the hOCR in the first place if you're using human-generated transcripts and not OCR? Hand coding each page would take forever. On Fri, Jan 17, 2014 at 3:24 AM, raffaele messuti < raffaele.mess...@gmail.com> wrote: > Padraic Stack wrote: > > What is a straightfo

Re: [CODE4LIB] Creating pdfs from images and their text

2014-01-17 Thread raffaele messuti
Padraic Stack wrote: > What is a straightforward way to combine the text with overlaid images > to create searchable pdfs? having transcription in hOCR[1] format the tool you should need is hocr2pdf[2]. i never tried for pdfs, years ago i made some djvu following this tutorial[3] [1] http://en.wi

Re: [CODE4LIB] Creating pdfs from images and their text

2014-01-16 Thread Daron Dierkes
I don't think I can answer your question but I we have a similar problem. I'm not sure about all OCR programs, but the version of Tesseract I've seen in Islandora creates two files, one is the .txt file you would expect and the other is an hOCR file with very interesting mark up linking words in t

[CODE4LIB] Creating pdfs from images and their text

2014-01-16 Thread Padraic Stack
Hi folks, I have a number of typescript / manuscript images on which it is quite time consuming to run OCR. (Or more accurately it is quite time consuming to correct the OCR). For some of these I have text files containing accurate transcriptions. In other cases I have TEI files with these t