Hi all, 

Thank you in advance.

I have questions regarding the accuracy improvement with fine tuning of the 
LSTM model. 

*BACKGROUND:* 

I want to use tesseract to recognise DNA/RNA sequences from PDF/TIFF. 
However, the accuracy is not great as the images have different font types 
and sizes.


*Method:*
I understand that I probably have two options:

   1. With the source images, I run the tesseract to generate the boxes, 
   manually correcting them using jTessBoxEditor to edit them and retrain a 
   new eng_dna.traindata file.
   2. With the current eng best LSTM train data file, fine tune the network 
   with a bunch of sequences texts.

*Questions and concerns:*

   - Can I mix different font type in the training data images?
   - Do I need to rely on any existing train data file? Since I want to 
   recognise some normal words and numbers in the DNA/RNA sequence images too.
   - I understand LSTM is line based recognition. Will it accept the mix 
   font training images with boxes.
   - Which one is the right one for my problem? Really have no clue and 
   experience when it comes to training your own model.


Thank you all.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6d9f89c7-66fb-4724-ac32-eac4274ecc69%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to