[tesseract-ocr] Questions regarding fine tuning of Tesseract 4.00alpha LSTM

Wang Zhimin Mon, 13 Nov 2017 01:04:42 -0800

Hi all, 

Thank you in advance.

I have questions regarding the accuracy improvement with fine tuning of the
LSTM model.

*BACKGROUND:*

I want to use tesseract to recognise DNA/RNA sequences from PDF/TIFF.
However, the accuracy is not great as the images have different font types
and sizes.

*Method:*
I understand that I probably have two options:

1. With the source images, I run the tesseract to generate the boxes,
manually correcting them using jTessBoxEditor to edit them and retrain a
new eng_dna.traindata file.
2. With the current eng best LSTM train data file, fine tune the network
with a bunch of sequences texts.

*Questions and concerns:*

- Can I mix different font type in the training data images?
- Do I need to rely on any existing train data file? Since I want to
recognise some normal words and numbers in the DNA/RNA sequence images too.
- I understand LSTM is line based recognition. Will it accept the mix
font training images with boxes.
- Which one is the right one for my problem? Really have no clue and
experience when it comes to training your own model.

Thank you all.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/6d9f89c7-66fb-4724-ac32-eac4274ecc69%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Questions regarding fine tuning of Tesseract 4.00alpha LSTM

Reply via email to