Dear Friends
I am using tesseract-3.02 for extracting text from scanned images.
As you know, tesseract could be integrated with the application. tesseract
internally uses a number of training data files, config files and so on,
which are installed in tessdata folder.
When I listed these files, I found that there are at least 29 files:
1. eng.traineddata
2. eng.cube.bigrams
3. eng.cube.fold
4. eng.cube.size
5. eng.cube.nn
6. eng.cube.params
7. eng.cube.word-freq
8. eng.tesseract_cube.nn
9. eng.cube.lm
10. osd.traineddata
11. logfile
12. api_config
13. box.train
14. box.train.stderr
15. digits
16. hocr
17. inter
18. linebox
19. ambigs.train
20. makebox
21. rebox
22. strokewidth
23. unlv
24. batch.nochop
25. matdemo
26. msdemo
27. nobatch
28. segdemo
29. batch
Can you please guide me about my following queries:
1) Are all these files needed when tesseract is doing OCR extraction?
-- If not, what are the minimum mandatory files required by tesseract to
work correctly?
2) Can we combine these mandatory files in one file and use it with
tesseract without unpacking?
-- I have a disk space constraint and also want to reduce the number of
reads from the disk.
Many thanks in advance for your guidance and time.
Best Regards,
- ganesh
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en