I wonder, would others here be interested in figuring out and documenting little bits of how the code works?
I spent some time in the line segmentation code a little while ago, to figure out better configuration parameters for line segmentation for the Ancient Greek training (which ended up being pretty successful), and I could certainly contribute a partial description of how it works. If others are interested in doing this for key sections (like the parts Dmitri suggested), perhaps we should set up a wiki and get to work? It wouldn't be comprehensive, of course, but sharing what we know could still prove pretty useful. What do people think? Is anyone else interested in doing this? I'll dig out the (very scrappy) notes I made on line segmentation, clean them up, and post them here, when I get time. If anyone else is interested, I'll set up a wiki somewhere. Nick On Thu, May 30, 2013 at 07:32:52PM +0400, Dmitri Silaev wrote: > Excellent post, Nick! The more I read, the more I felt I had to ask > these questions myself, but didn't yet. I'm afraid, though, many of > them would remain unanswered. > > Because after several years of monitoring and asking in this forum I > got used to the feeling that principal developers make only new > release announcements. In the early years, they were much more active > in discussions. I can suppose many of forum questions are tedious to > answer over and over again, the forum search can be used, and many > people just feel lazy to use it. But some of them are not like that > and deserve answers. > > Now it looks like Google is doing us a favor making a formerly > commercial engine outsource and sharing its developments from time to > time. The community contribution now is constrained by enhancing > release packages and fixing trivial bugs. Without a proper > documentation or at least clues on how all this (not only Cube) works, > developers keep community contribution nominal. I personally need more > info and am ready to contribute, if I begin to understand the code > enough. I used to surf the code alone, but the potential of this > approach is limited. Off the bat, I'm interested in segmentation, > details on class pruner and integer matcher, description of Cube, best > practices on training data generation. I think, there are more to > come, once I get more info on these. > > -- > Dmitri > > > On Thu, May 30, 2013 at 6:48 PM, Nick White <[email protected]> wrote: > > Hi Tesseractors, > > > > I am feeling a bit fed up about the lack of openness with the > > Tesseract project. > > > > The addition of the cube mode, and several trainings, with > > absolutely no documentation, or (as far as I can tell) any tools to > > create cube training files, is a good example of this. > > > > As is the lack of tif/box files for any of the core training files > > in the project. > > > > Keeping the cube tools and documentation private sucks royally. If > > they aren't perfect or polished, it doesn't matter; we could help > > to fix them up! > > > > I suspect some of the tif/box files for training aren't being > > released because of concerns about copyright of the image files. If > > that's the case please work to clear them up, or create freely > > reusable versions. > > > > I love Tesseract; having a very high quality free software OCR > > package is awesome, and I'm very grateful for the amazing work being > > done on it. But I find the lack of parity between those inside > > Google and the wider community to be rather troubling. > > > > If there's anything I can do to help make cube training tools and > > documentation available, or the training source files, I'd be very > > happy to help. Replying offlist if appropriate is fine. > > > > Nick > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > --- > > You received this message because you are subscribed to the Google Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [email protected]. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

