Great to see someone using Tesseract to preserve a little history! 

The first thing you should do is start with as close to the original as 
possible.  Since you're working with this 
scan: https://archive.org/details/filmdailyyearboo00film_4
that would be the zip containing the original JPEG2000 
images: 
https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_jp2.zip

Note that the Internet Archive runs all uploads through ABBY FineReader and 
the output from that is available 
here: 
https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_abbyy.gz
Similar to Tesseracts hOCR output it includes coordinates for all text 
blocks, so if it messed up the page segmentation it should be possible to 
post-process to reconstruct the correct flow.  You can find an ABBY parser 
that I wrote for another purpose 
here: https://github.com/tfmorris/oed/blob/master/oedabby.py

If you want to run things through Tesseract to compare for better quality 
(or just for the fun of it), you should be able to do that directly if your 
copy of Tesseract was built against a version of Leptonica with JPEG2000 
support (mine was). I used this command to produce the attached output.

$ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr

Not surprisingly, Tesseract doesn't get the page segmentation correct.  You 
could either preprocess to cut the image into four columns that you OCR 
separately or post-process the hOCR output to put all the words in the 
correct order.

When I manually crop to just the first column, I get pretty reasonable (to 
my eye) results. Files attached.

Tom


On Tuesday, March 29, 2016 at 2:29:27 AM UTC-4, [email protected] 
wrote:
>
> Hi All,
>
> I've been experimenting with tesseract and have been impressed with the 
> accuracy of the software. I'm looking to use tesseract to process around 
> 200 pages of printed material that was printed in around 1934. I've 
> attached a sample of the PDF I need to work with. 
>
> I'm looking to improve the accuracy of the OCR process as much as 
> possible. I believe that with the vast, and I admit intimidating, list of 
> options available that there are ways to improve the accuracy. Speed of 
> recognition isn't as high a factor as accuracy for this project. 
>
> The following steps is what I've found works best so far:
>
> 1. Convert the PDF to TIFF
>
> convert -density 350 input.pdf -type Grayscale -background white +matte 
> -depth 32 input.tif
>
>
> 2. Clean the TIFF file using the text cleaner script [1]
>
> textcleaner -t 25 -s 1 -g input.tif cleaned.tif
>
>
> 3. OCR the cleaned TIFF file.
>
> tesseract cleaned.tif ./test-ocr
>
>
> Any thoughts on ways to improve the accuracy will be gratefully received. 
>
>
> With thanks. 
>
>
> -Corey
>
>
> [1] http://www.fmwconcepts.com/imagemagick/textcleaner/
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2822a9c9-049a-44bb-9612-51560c4c2c9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Attachment: pg738.hocr
Description: Binary data

CALICO ROCK—659 Palace ............. 500 LITTLE ROCK—81.679 PIGGOTT—1,885

 
  

Gem ............. 20001 Royal -------------- 250 Arkansas ........ 1200c1 
Franklin ........... 300
Capitol ........ 1200(P)
CAMDEN—7.273 A FORDYCE—3,206;150 figgfigl’lla ............ €32 pINE BLUFF—20,750
Ilglialco ........... 50°C] mus“ """"""" Little Roxy: ......... 300 Alamo .' 
............ 582
who .............. 250 New ...............325 Community ......... 600
FORREST CITY—4,594 Pulaski ________ 1000(1)) Saenger ........... 1580
CARLISLE—907 Imperial ............ 600 Prospect ........... mm
Uptown ........... 250 Royal .......... 900(P) POCAHONTAS—1.896
FORT SMITH—31,429 Arcade ......... 240*Cl
CLARENDON—2,149 Hgyt’s ............. 300 MAGNOLIA—2,989
Paramount ....... 250c1 Jme , --------------- 650 Macco ............. soo 
PRAIRIE GROVE—743
Little Rockefeller. . . .200 §4ystlc ------------ £33 Cozy ............ 300C]
Reivvlto. . . . ........ 400 MALVERN—5,115
CLARKSVILLE—3,041 ' a 1 """""" :89 Liberty ............ 327 PRESCOTT—3,033
Temp e ..... 8
Dunlap ............ 400 Gem ............... 150
Little .............. 200 GLENWOOD SPRGS. MANILLA—l,226
CONWAY—5 534 Glenwood .......... 350 New ............... 275 P 1 RECTOR 250
' a ace .............
CGOHWS'Y ............ e233 GURDON—2,172 MARIANNA—4,3l4
ran ------------- Wright’s ___________ 200 Imperial ........... 400 ROGERS—3,554
Victory ............ 400
State -------------- 400 Pastime .......... 3000 Star ............... 500 
RUSSELVILLE—5,628
_ Community ........ 750
ParagiSeTTER 1'0“ 186 HARRISON_3,626 McGEHEE—3,4ss New ............... 500
"""""" Lyric ..............498 Palace ------~--3°°'Cl
COTTON PLANT— Ritz ............... 700 R ISEARcY—3’387 50
_ ia to .............. 0
Fox .....1’.6.8.9. ...... 400 E HARTFORD 1’21300 MENA—3'113
C merson """""" Lyric .............. 500 SILOAM SPGS.—2,378
ROSSETT—2,811 _ Rialto ............. 250
Crossett .......... sooc1 Cozy HAZE” 788 2°C MONETTE—l,111
.............. r . 2 K ER—2,
DANVILLE—761 I\ew .............. 2 5 JojMAC 0V 5:450
Pastime ............ 250 HEBER SPRGS-_1"1):° MONTICELLO—3,076
DARDANELLE—1,832 New ............... 0 Amusu ............. 400 
CfnioeriNGDALE—Zflggo
New ............. 300Cl HELENA—8,316 MORRILLTON—4,o43 STAMPS—2 705
DERMOTT—2,942 Parammmt --------- 233 Palace ............. 450 Br ,e ' 300
Allied .............. 300 Plaza .............. Rialto .......... 450*01 own _ 
............
STAR CITY—932
DE WITT—l.853 HOPE—6’0“ NEWARK—897 Central 1500
New ............... 400 gig?” ----------- 5093301 Royal .............. 27s """""
"""""" STUTTGART—4.927
DeQUEEN—2.893 HOT SPRINGS_20 238 NEWPORT—4,547 Majestic ........... 750
Grand ............. 500 Best ’642 Capitol ,,,,,,,,,,,, 500 Riccland ......... 
400'Cl
DES AR _1, Central ............ oUU _
Dixie ______ (F _ . 3.82%0C1 Princess ............ 938 NORFOLK—247 TEXARFANA 
10'7“]
Royal ............. 700 New Lyric ....... 200Cl Little Princess ' ' ‘300C
DIERKS—l,544 Spa _______________ 250 Paramount ........ 1900
Dixie ............. 300 NORPHLET—l,063 Strand ............. 700
T _ HXOXIE—li448285C1 Strand ............. 380 TRUMAN—995
_ rlang e .........
Gem DUMAS 1,669250Cl NORTH LITTLE Grand ........... 250Cl
HUGHES—815 . ROCK—19.418 VAN BUREN—5,182
EARLE Star ............... 300 Princess ........ lOOOCl N R l 50
Princess 350Cl Rlalto .............. ()UU ew oya """"" 0
""""" HUNTINGTON—813 WALDRON 1077
EL DORADO—16.421 Majestice .......... 300 .NASHVILLE—ZA69 P' ' 35
Majestic ........... 700 HUNTSVILLE—1 000 Liberty ' ' ' ' ' ’ ’ ‘ ' ' ' ”380 
mes ............ 0
Rialto ............ llUU ’
Star ............... soo Dixie .............. 150 OSCEOLA—z'm WALNuzTomRmGL
Gem ............... 350 Shafum ' _ _ _ . 250
ENGLAND—2.130 JONESBORO—10.326 """"
Best _______________ 400 Liberty ............ 400 OZARK—l-564 WARREN—2.523
Strand ............ mm: Ozark ........... 300C] P t' ............
EUDORA—2,020 Palace ......... 300'Cl as mm 500
Crystal 400 PARAGOULD—5956 WEST HELENA—4.489
"""""""" JUNCTION CITY—814 Capitol ...........,.700 Gem ............200'Cl
EUREKA SPRINGS— Palace ........... ZSO’Cl MaJCStiC -------- 300'C1 Palace 
............. 387
2.276
Commodore . 200 LEACHVILLE—l.157 PARIS—3.234 WILMOT—777
"""" New ............250Cl Strand .............400 Strand .............200
FAYETTEVILLE—
7.394 LINCOLN—687 PARKIN—1.676 WYNNE—3.505
Ozark ............. 300 Cozy .............. 534 Princess .......... 250C] 
Imperial ........... 300

738

 

CALICO ROCK—659

Gem ............. 200Cl
CAMDEN—7.273
Malco ........... SOOCl
Rialto .............. 250
CARLISLE—907
Uptown ........... 250

Paramount ....... 250Cl
Little Rockefeller. . . .200

 

CLARKSVILLE—3,041
Dunlap ............ 400
Little .............. 200
CONWAY—5,534
Conway ............ 500
Grand ............. 600
CORNING—L550
State .............. 400
COTTER—l,064
Paradise ........... 186
COTTON PLANT—
1,689
Fox ............... 400
CROSSETT—2,811
Crossett .......... 500C]
DANVILLE—761
Pastime ............ 250
DARDANELLE—1,832
New ............. 300Cl
DERMOTT—2,942
Allied .............. 300
DE WEST—1,853
New ............... 400
DeQUEEN—2.893
Grand ............. 500
DES ARC—1,286
Dixie ............ 200Cl
DIERKS—l,544
Dixie .............. 300
DUMAS—l,669
Gem ............. 250Cl
EARLE
Princess ......... 350Cl
EL DORADO—16.421
Majestic
Rialto ..
Star ...............
ENGLAND—2,130
Best ............... 400
EUDORA—2,020
Crystal ............. 400
EUREKA SPRINGS—
2.276
Commodore ........ 200
FAYETTEVILLE—
7.394

Ozark ............. 300

CALICO ROCK—659

Gem ............. 200Cl CAMDEN—7.273 Malco ........... SOOCl Rialto .............. 250 CARLISLE—907 Uptown ........... 250

Paramount ....... 250Cl Little Rockefeller. . . .200

CLARKSVILLE—3,041 Dunlap ............ 400 Little .............. 200 CONWAY—5,534 Conway ............ 500 Grand ............. 600 CORNING—L550 State .............. 400 COTTER—l,064 Paradise ........... 186 COTTON PLANT— 1,689 Fox ............... 400 CROSSETT—2,811 Crossett .......... 500C] DANVILLE—761 Pastime ............ 250 DARDANELLE—1,832 New ............. 300Cl DERMOTT—2,942 Allied .............. 300 DE WEST—1,853 New ............... 400 DeQUEEN—2.893 Grand ............. 500 DES ARC—1,286 Dixie ............ 200Cl DIERKS—l,544 Dixie .............. 300 DUMAS—l,669 Gem ............. 250Cl EARLE Princess ......... 350Cl EL DORADO—16.421 Majestic Rialto .. Star ............... ENGLAND—2,130 Best ............... 400 EUDORA—2,020 Crystal ............. 400 EUREKA SPRINGS— 2.276 Commodore ........ 200 FAYETTEVILLE— 7.394

Ozark ............. 300

Reply via email to