Hi all, Thank you in advance.
I have questions regarding the accuracy improvement with fine tuning of the LSTM model. *BACKGROUND:* I want to use tesseract to recognise DNA/RNA sequences from PDF/TIFF. However, the accuracy is not great as the images have different font types and sizes. *Method:* I understand that I probably have two options: 1. With the source images, I run the tesseract to generate the boxes, manually correcting them using jTessBoxEditor to edit them and retrain a new eng_dna.traindata file. 2. With the current eng best LSTM train data file, fine tune the network with a bunch of sequences texts. *Questions and concerns:* - Can I mix different font type in the training data images? - Do I need to rely on any existing train data file? Since I want to recognise some normal words and numbers in the DNA/RNA sequence images too. - I understand LSTM is line based recognition. Will it accept the mix font training images with boxes. - Which one is the right one for my problem? Really have no clue and experience when it comes to training your own model. Thank you all. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6d9f89c7-66fb-4724-ac32-eac4274ecc69%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

