I had to edit a few Tesseract box files to generate training data recently 
and didn't find any of the existing tools 
<https://code.google.com/p/tesseract-ocr/wiki/AddOns#Box_file_editors> to 
my liking. I wanted something that ran on Mac OS X and showed letters 
inside their boxes.

So I built a web-based tool which I'm calling boxedit.

Here's the tool: http://www.danvk.org/boxedit/
Demo with preloaded data: http://www.danvk.org/boxedit/demo.html
Source code & instructions: https://github.com/danvk/boxedit/

A few things to like about it:
- It's entirely browser-based, so it runs on any platform and requires no 
installation.
- You can use the browser's zoom in/out features.
- It shows OCR'd letters on top of the source image, so the accuracy is 
easy to gauge.
- It can split boxes N ways.
- You can edit the raw box data or use the GUI, either works & they stay in 
sync.
- It's easy to get going: drag & drop an image and its box file to get 
started.

A few things to dislike:
- The UI could use some work: the overlaying of transcribed letters could 
be much clearer.
- Saving your changes back to disk is tedious (my best solution is to 
copy/paste back into the box file).
- Missing a few important features (e.g. n-way merge and moving/resizing 
boxes visually)

If people find this useful, I'm happy to polish it a bit more. Feel free to 
file 
issues <https://github.com/danvk/boxedit/issues> on GitHub.

  - Dan

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c7108500-c70a-4cf2-b3db-c3c3f3505122%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to