Use https://github.com/OCR-D/ocrd-train since you have line images and
transcription.

On Thu, Aug 29, 2019 at 1:13 PM Phillip Ströbel <[email protected]>
wrote:

> dear tesseract community
>
> atm, i'm trying to compare the performance of different ocr engines, one
> of which is tesseract.
> i have a ground truth already, which comes in page-xml files, where lines
> look as follows:
>
> <TextLine id="tl_7" primaryLanguage="German" custom="readingOrder
> {index:0;}">
> <Coords points="1281,594 1734,594 1734,657 1281,657"/>
> <Baseline points="1282,642 1734,645"/>
> <Word id="w_w1aab1b9b2b3b1ab1" language="German" custom="readingOrder
> {index:0;}">
> <Coords points="1281,594 1360,594 1360,643 1281,643"/>
> <TextEquiv>
> <Unicode>den</Unicode>
> </TextEquiv>
> <TextStyle fontFamily="Times New Roman" fontSize="16.0"/>
> </Word>
> <Word id="w_w1aab1b9b2b3b1b1b1" language="German" custom="readingOrder
> {index:1;}">
> <Coords points="1391,597 1452,597 1452,657 1391,657"/>
> <TextEquiv>
> <Unicode>19.</Unicode>
> </TextEquiv>
> <TextStyle fontFamily="Times New Roman" fontSize="15.0"/>
> </Word>
> <Word id="w_w1aab1b9b2b3b1b2b1" language="German" custom="readingOrder
> {index:2;}">
> <Coords points="1467,597 1734,597 1734,657 1467,657"/>
> <TextEquiv>
> <Unicode>Heumonat.</Unicode>
> </TextEquiv>
> <TextStyle fontFamily="Times New Roman" fontSize="16.0"/>
> </Word>
> <TextEquiv>
> <Unicode>den 19. Heumonat.</Unicode>
> </TextEquiv>
> </TextLine>
>
> i tried to follow the tesseract tutorial to train a model from scratch.
> since i already have the coordinates of the line boxes, i created the .box
> file from the points attribute in the TextLine Coords (line-based, so it
> was smth like "WordStr <left> <bottom> <right> <top> 0 #text \n <left>
> <bottom> <right> <top>"). when i try to produce the lstmf files, however, i
> get many warnings that there is no box overlapping the text line.
> i segmented the data already for another ocr system, which expects line
> images and one text file with the corresponding transcription. i found that
> the --psm option would allow for taking lines as an input.
>
> since from the tutorial not everything is clear to me, i would like to ask
> the following questions:
>
>
>    1. 1. if i have line tiffs and the corresponding text, say
>    line0001.tif and the text "den 19. Heumonat" in line0001.txt how do i
>       1. produce the unicharset file --> i used the unicharset_extractor
>       and let it run over all .box files i had and this worked, but is it the
>       right way?
>       2. what to the .box files need to look like? what coordinates do i
>       need to use? after all, it is the whole image
>       3. what data do i use to produce the required files? do i download
>       the *.traineddata from tesseract? do i produce this myself? does 
> tesseract
>       need a text file with all the text from the training data, wordlist from
>       training data, and so on?
>
> i'm sorry, i know there is a doku but i find it very confusing. thanks in
> advance for any helpful hints.
>
> best,
>
> phillip
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ecdc6d31-f505-443f-8058-99c6f6670427%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ecdc6d31-f505-443f-8058-99c6f6670427%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXuZwV3k8XBKZ3iAff1GVA1T%2BM1dGff%2BMhzAXGBR4K7uw%40mail.gmail.com.

Reply via email to