thanks a lot for your rapid answer! do i have to change the psm to 7 in the makefile? or doesn't this matter too much?
On Thursday, 29 August 2019 10:02:08 UTC+2, shree wrote: > > Use https://github.com/OCR-D/ocrd-train since you have line images and > transcription. > > On Thu, Aug 29, 2019 at 1:13 PM Phillip Ströbel <[email protected] > <javascript:>> wrote: > >> dear tesseract community >> >> atm, i'm trying to compare the performance of different ocr engines, one >> of which is tesseract. >> i have a ground truth already, which comes in page-xml files, where lines >> look as follows: >> >> <TextLine id="tl_7" primaryLanguage="German" custom="readingOrder >> {index:0;}"> >> <Coords points="1281,594 1734,594 1734,657 1281,657"/> >> <Baseline points="1282,642 1734,645"/> >> <Word id="w_w1aab1b9b2b3b1ab1" language="German" custom="readingOrder >> {index:0;}"> >> <Coords points="1281,594 1360,594 1360,643 1281,643"/> >> <TextEquiv> >> <Unicode>den</Unicode> >> </TextEquiv> >> <TextStyle fontFamily="Times New Roman" fontSize="16.0"/> >> </Word> >> <Word id="w_w1aab1b9b2b3b1b1b1" language="German" custom="readingOrder >> {index:1;}"> >> <Coords points="1391,597 1452,597 1452,657 1391,657"/> >> <TextEquiv> >> <Unicode>19.</Unicode> >> </TextEquiv> >> <TextStyle fontFamily="Times New Roman" fontSize="15.0"/> >> </Word> >> <Word id="w_w1aab1b9b2b3b1b2b1" language="German" custom="readingOrder >> {index:2;}"> >> <Coords points="1467,597 1734,597 1734,657 1467,657"/> >> <TextEquiv> >> <Unicode>Heumonat.</Unicode> >> </TextEquiv> >> <TextStyle fontFamily="Times New Roman" fontSize="16.0"/> >> </Word> >> <TextEquiv> >> <Unicode>den 19. Heumonat.</Unicode> >> </TextEquiv> >> </TextLine> >> >> i tried to follow the tesseract tutorial to train a model from scratch. >> since i already have the coordinates of the line boxes, i created the .box >> file from the points attribute in the TextLine Coords (line-based, so it >> was smth like "WordStr <left> <bottom> <right> <top> 0 #text \n <left> >> <bottom> <right> <top>"). when i try to produce the lstmf files, however, i >> get many warnings that there is no box overlapping the text line. >> i segmented the data already for another ocr system, which expects line >> images and one text file with the corresponding transcription. i found that >> the --psm option would allow for taking lines as an input. >> >> since from the tutorial not everything is clear to me, i would like to >> ask the following questions: >> >> >> 1. 1. if i have line tiffs and the corresponding text, say >> line0001.tif and the text "den 19. Heumonat" in line0001.txt how do i >> 1. produce the unicharset file --> i used the unicharset_extractor >> and let it run over all .box files i had and this worked, but is it >> the >> right way? >> 2. what to the .box files need to look like? what coordinates do i >> need to use? after all, it is the whole image >> 3. what data do i use to produce the required files? do i download >> the *.traineddata from tesseract? do i produce this myself? does >> tesseract >> need a text file with all the text from the training data, wordlist >> from >> training data, and so on? >> >> i'm sorry, i know there is a doku but i find it very confusing. thanks in >> advance for any helpful hints. >> >> best, >> >> phillip >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ecdc6d31-f505-443f-8058-99c6f6670427%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/ecdc6d31-f505-443f-8058-99c6f6670427%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4bd07337-b6e8-485f-b998-933416b34d20%40googlegroups.com.

