Hi,
I do not think tesseract page segmentation can handle this kind on layout.
It's more oriented towards paragraphs, tables and classic text layouts. And
I think page segmentation is not based on neural networks.

I would try something like opencv EAST
<https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/>
text detection in this case or try to detect, with custom code, the white
regions of the baloons (something like this
<https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/>
).

Also the training document you are referring to is for tesseract 3.x,
training with 4.x is easier and there is no need to draw boxes. Again this
training has nothing to do with page segmentation (AFAIK).


Bye

Lorenzo




Il giorno gio 23 mag 2019 alle ore 07:12 Krzysztof Studnicki <
[email protected]> ha scritto:

> Hello!
> I'm trying to train Tesseract to give me text from manga pages.
> So far I have mixed results. I've tried using stock .traineddata file and
> self-made ones, but accuracy is similar (I have only trained it with couple
> of pages, I know it's not enough).
> When I tried to get text from a whole page, it recognized many of the
> words, but a lot more random characters were between (it recognized letters
> from drawings).
> Much better result is from a cropped cloud - almost 83% accuracy, but the
> best is when only text is cropped with a little white border around it -
> 94%.
>
> Is it possible to teach Tesseract to recognize text on such pages? I was
> thinking about preparing dozen of such pages with corresponding box files
> and by using the process explained here.
> <http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/>
> I thought about more work by using some other software to recognize and
> crop text clouds, but I feel it kinda defeats a purpose of using a full
> potential of Tesseract's neural network.
> The question is, if it can be taught to search for text in the sea of
> drawings and how. So far I'm going in circles and seeing no end...
>
> I have included an image of a part of a page that was processed by using
> command "tesseract IMAGE_NAME BOX_FILE_NAME batch.nochop makebox".
> To check accuracy (and correct errors) I'm using QT Box Editor.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/03065e53-b571-461a-9b61-ca330d4b32b6%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/03065e53-b571-461a-9b61-ca330d4b32b6%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxH154wMv3qMS8tMg-UpvkV%2BLFDoUNt%3DBnN%2ByUzofY7PA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to