Please see the repo tesseract-ocr/tesstrain, specifically wiki pages regarding training for Fraktur.
On Fri, Dec 27, 2019, 00:51 Scott M. Sanders <[email protected]> wrote: > If you can't see the bad_rep.html, here is a pdf version. > > Le jeudi 26 décembre 2019 14:17:46 UTC-5, Scott M. Sanders a écrit : >> >> >> I'm trying to ocr over 2000 pdf copies of Bordeaux's 18th-century >> newspaper. My goal is to recreate the Bordeaux theater repertoire from 1784 >> to 1790. This should be easy if I can identify the word "Spectacles" and >> then find any words that are italicized after Spectacles. These words are >> either the name of a theatrical work or the name of an artist. >> >> I've set up a workflow in Jupyter Notebook that has begun the process. >> I've attached a copy of the pdf (bp2.pdf) and a copy of my code and output >> (bord_prj.html). >> >> Here are my trouble spots. I would appreciate any suggestions to the >> following questions. >> >> 1. 18th-century French spelling and type >> I was wondering if there were any better training sets for 18th-century >> French that deal with the long s's and with 18th-century spellings (i.e. >> the ois, oit, oient verb endings). >> >> 2. Retaining formatting >> I'm using pytesseract to ocr a jpeg of the pdf. I haven't found how to >> retain style format in my ocr text. >> >> 3. Missing steps in my workflow >> I'm currently using a binarization function to make the ocr work better. >> To improve the results, I'll also need to put the columns of text into >> bounding boxes. >> >> 4. Processing multiple files >> Once I've figured out the first steps, I'll need to set up a workflow >> that allows me to process multiple pdfs. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/b6cfaad4-44d8-4893-b7d7-eb1847cfacfc%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b6cfaad4-44d8-4893-b7d7-eb1847cfacfc%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVPF5PPUySLFb2LeFM_4XqAM-sicxP8D99AAWo3d-6O5Q%40mail.gmail.com.

