Great to see someone using Tesseract to preserve a little history! The first thing you should do is start with as close to the original as possible. Since you're working with this scan: https://archive.org/details/filmdailyyearboo00film_4 that would be the zip containing the original JPEG2000 images: https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_jp2.zip
Note that the Internet Archive runs all uploads through ABBY FineReader and the output from that is available here: https://archive.org/download/filmdailyyearboo00film_4/filmdailyyearboo00film_4_abbyy.gz Similar to Tesseracts hOCR output it includes coordinates for all text blocks, so if it messed up the page segmentation it should be possible to post-process to reconstruct the correct flow. You can find an ABBY parser that I wrote for another purpose here: https://github.com/tfmorris/oed/blob/master/oedabby.py If you want to run things through Tesseract to compare for better quality (or just for the fun of it), you should be able to do that directly if your copy of Tesseract was built against a version of Leptonica with JPEG2000 support (mine was). I used this command to produce the attached output. $ tesseract filmdailyyearboo00film_4_0742.jp2 pg738 hocr Not surprisingly, Tesseract doesn't get the page segmentation correct. You could either preprocess to cut the image into four columns that you OCR separately or post-process the hOCR output to put all the words in the correct order. When I manually crop to just the first column, I get pretty reasonable (to my eye) results. Files attached. Tom On Tuesday, March 29, 2016 at 2:29:27 AM UTC-4, [email protected] wrote: > > Hi All, > > I've been experimenting with tesseract and have been impressed with the > accuracy of the software. I'm looking to use tesseract to process around > 200 pages of printed material that was printed in around 1934. I've > attached a sample of the PDF I need to work with. > > I'm looking to improve the accuracy of the OCR process as much as > possible. I believe that with the vast, and I admit intimidating, list of > options available that there are ways to improve the accuracy. Speed of > recognition isn't as high a factor as accuracy for this project. > > The following steps is what I've found works best so far: > > 1. Convert the PDF to TIFF > > convert -density 350 input.pdf -type Grayscale -background white +matte > -depth 32 input.tif > > > 2. Clean the TIFF file using the text cleaner script [1] > > textcleaner -t 25 -s 1 -g input.tif cleaned.tif > > > 3. OCR the cleaned TIFF file. > > tesseract cleaned.tif ./test-ocr > > > Any thoughts on ways to improve the accuracy will be gratefully received. > > > With thanks. > > > -Corey > > > [1] http://www.fmwconcepts.com/imagemagick/textcleaner/ > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2822a9c9-049a-44bb-9612-51560c4c2c9f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
pg738.hocr
Description: Binary data
CALICO ROCK—659 Palace ............. 500 LITTLE ROCK—81.679 PIGGOTT—1,885 Gem ............. 20001 Royal -------------- 250 Arkansas ........ 1200c1 Franklin ........... 300 Capitol ........ 1200(P) CAMDEN—7.273 A FORDYCE—3,206;150 figgfigl’lla ............ €32 pINE BLUFF—20,750 Ilglialco ........... 50°C] mus“ """"""" Little Roxy: ......... 300 Alamo .' ............ 582 who .............. 250 New ...............325 Community ......... 600 FORREST CITY—4,594 Pulaski ________ 1000(1)) Saenger ........... 1580 CARLISLE—907 Imperial ............ 600 Prospect ........... mm Uptown ........... 250 Royal .......... 900(P) POCAHONTAS—1.896 FORT SMITH—31,429 Arcade ......... 240*Cl CLARENDON—2,149 Hgyt’s ............. 300 MAGNOLIA—2,989 Paramount ....... 250c1 Jme , --------------- 650 Macco ............. soo PRAIRIE GROVE—743 Little Rockefeller. . . .200 §4ystlc ------------ £33 Cozy ............ 300C] Reivvlto. . . . ........ 400 MALVERN—5,115 CLARKSVILLE—3,041 ' a 1 """""" :89 Liberty ............ 327 PRESCOTT—3,033 Temp e ..... 8 Dunlap ............ 400 Gem ............... 150 Little .............. 200 GLENWOOD SPRGS. MANILLA—l,226 CONWAY—5 534 Glenwood .......... 350 New ............... 275 P 1 RECTOR 250 ' a ace ............. CGOHWS'Y ............ e233 GURDON—2,172 MARIANNA—4,3l4 ran ------------- Wright’s ___________ 200 Imperial ........... 400 ROGERS—3,554 Victory ............ 400 State -------------- 400 Pastime .......... 3000 Star ............... 500 RUSSELVILLE—5,628 _ Community ........ 750 ParagiSeTTER 1'0“ 186 HARRISON_3,626 McGEHEE—3,4ss New ............... 500 """""" Lyric ..............498 Palace ------~--3°°'Cl COTTON PLANT— Ritz ............... 700 R ISEARcY—3’387 50 _ ia to .............. 0 Fox .....1’.6.8.9. ...... 400 E HARTFORD 1’21300 MENA—3'113 C merson """""" Lyric .............. 500 SILOAM SPGS.—2,378 ROSSETT—2,811 _ Rialto ............. 250 Crossett .......... sooc1 Cozy HAZE” 788 2°C MONETTE—l,111 .............. r . 2 K ER—2, DANVILLE—761 I\ew .............. 2 5 JojMAC 0V 5:450 Pastime ............ 250 HEBER SPRGS-_1"1):° MONTICELLO—3,076 DARDANELLE—1,832 New ............... 0 Amusu ............. 400 CfnioeriNGDALE—Zflggo New ............. 300Cl HELENA—8,316 MORRILLTON—4,o43 STAMPS—2 705 DERMOTT—2,942 Parammmt --------- 233 Palace ............. 450 Br ,e ' 300 Allied .............. 300 Plaza .............. Rialto .......... 450*01 own _ ............ STAR CITY—932 DE WITT—l.853 HOPE—6’0“ NEWARK—897 Central 1500 New ............... 400 gig?” ----------- 5093301 Royal .............. 27s """"" """""" STUTTGART—4.927 DeQUEEN—2.893 HOT SPRINGS_20 238 NEWPORT—4,547 Majestic ........... 750 Grand ............. 500 Best ’642 Capitol ,,,,,,,,,,,, 500 Riccland ......... 400'Cl DES AR _1, Central ............ oUU _ Dixie ______ (F _ . 3.82%0C1 Princess ............ 938 NORFOLK—247 TEXARFANA 10'7“] Royal ............. 700 New Lyric ....... 200Cl Little Princess ' ' ‘300C DIERKS—l,544 Spa _______________ 250 Paramount ........ 1900 Dixie ............. 300 NORPHLET—l,063 Strand ............. 700 T _ HXOXIE—li448285C1 Strand ............. 380 TRUMAN—995 _ rlang e ......... Gem DUMAS 1,669250Cl NORTH LITTLE Grand ........... 250Cl HUGHES—815 . ROCK—19.418 VAN BUREN—5,182 EARLE Star ............... 300 Princess ........ lOOOCl N R l 50 Princess 350Cl Rlalto .............. ()UU ew oya """"" 0 """"" HUNTINGTON—813 WALDRON 1077 EL DORADO—16.421 Majestice .......... 300 .NASHVILLE—ZA69 P' ' 35 Majestic ........... 700 HUNTSVILLE—1 000 Liberty ' ' ' ' ' ’ ’ ‘ ' ' ' ”380 mes ............ 0 Rialto ............ llUU ’ Star ............... soo Dixie .............. 150 OSCEOLA—z'm WALNuzTomRmGL Gem ............... 350 Shafum ' _ _ _ . 250 ENGLAND—2.130 JONESBORO—10.326 """" Best _______________ 400 Liberty ............ 400 OZARK—l-564 WARREN—2.523 Strand ............ mm: Ozark ........... 300C] P t' ............ EUDORA—2,020 Palace ......... 300'Cl as mm 500 Crystal 400 PARAGOULD—5956 WEST HELENA—4.489 """""""" JUNCTION CITY—814 Capitol ...........,.700 Gem ............200'Cl EUREKA SPRINGS— Palace ........... ZSO’Cl MaJCStiC -------- 300'C1 Palace ............. 387 2.276 Commodore . 200 LEACHVILLE—l.157 PARIS—3.234 WILMOT—777 """" New ............250Cl Strand .............400 Strand .............200 FAYETTEVILLE— 7.394 LINCOLN—687 PARKIN—1.676 WYNNE—3.505 Ozark ............. 300 Cozy .............. 534 Princess .......... 250C] Imperial ........... 300 738
CALICO ROCK—659 Gem ............. 200Cl CAMDEN—7.273 Malco ........... SOOCl Rialto .............. 250 CARLISLE—907 Uptown ........... 250 Paramount ....... 250Cl Little Rockefeller. . . .200 CLARKSVILLE—3,041 Dunlap ............ 400 Little .............. 200 CONWAY—5,534 Conway ............ 500 Grand ............. 600 CORNING—L550 State .............. 400 COTTER—l,064 Paradise ........... 186 COTTON PLANT— 1,689 Fox ............... 400 CROSSETT—2,811 Crossett .......... 500C] DANVILLE—761 Pastime ............ 250 DARDANELLE—1,832 New ............. 300Cl DERMOTT—2,942 Allied .............. 300 DE WEST—1,853 New ............... 400 DeQUEEN—2.893 Grand ............. 500 DES ARC—1,286 Dixie ............ 200Cl DIERKS—l,544 Dixie .............. 300 DUMAS—l,669 Gem ............. 250Cl EARLE Princess ......... 350Cl EL DORADO—16.421 Majestic Rialto .. Star ............... ENGLAND—2,130 Best ............... 400 EUDORA—2,020 Crystal ............. 400 EUREKA SPRINGS— 2.276 Commodore ........ 200 FAYETTEVILLE— 7.394 Ozark ............. 300
CALICO ROCK—659
Gem ............. 200Cl CAMDEN—7.273 Malco ........... SOOCl Rialto .............. 250 CARLISLE—907 Uptown ........... 250
Paramount ....... 250Cl Little Rockefeller. . . .200
CLARKSVILLE—3,041 Dunlap ............ 400 Little .............. 200 CONWAY—5,534 Conway ............ 500 Grand ............. 600 CORNING—L550 State .............. 400 COTTER—l,064 Paradise ........... 186 COTTON PLANT— 1,689 Fox ............... 400 CROSSETT—2,811 Crossett .......... 500C] DANVILLE—761 Pastime ............ 250 DARDANELLE—1,832 New ............. 300Cl DERMOTT—2,942 Allied .............. 300 DE WEST—1,853 New ............... 400 DeQUEEN—2.893 Grand ............. 500 DES ARC—1,286 Dixie ............ 200Cl DIERKS—l,544 Dixie .............. 300 DUMAS—l,669 Gem ............. 250Cl EARLE Princess ......... 350Cl EL DORADO—16.421 Majestic Rialto .. Star ............... ENGLAND—2,130 Best ............... 400 EUDORA—2,020 Crystal ............. 400 EUREKA SPRINGS— 2.276 Commodore ........ 200 FAYETTEVILLE— 7.394
Ozark ............. 300

