Thanks, Nick. It is good to have some cube info. Please add the list of languages that use cube mode. I know that Hindi uses option 2 i.e. combined cube and tess mode.
Regarding neural networks, I have read that nn has been removed from tesseract as it was not open source. That may explain why there is minimal nn code in 3.02. Please see: http://www.cedricve.me/2013/04/12/how-to-train-tesseract/ Shree Shree Devi Kumar ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 10, 2013 at 10:28 PM, Nick White <[email protected]>wrote: > Right then, I've created a wiki in Google Code for this collected > effort. > > https://code.google.com/p/tesseract-ocr-extradocs/ > > I have spent some time this last week reading some of the cube code > and figuring out the purpose of the various cube training files. I > still don't know the most interesting stuff, which is exactly how > the .nn files are used, but it was taking me a while to read the > code so I though I'd just post what I have so far. > > If anyone wants to add to the wiki let me know and I'll gladly add > you to the project. > > The next thing on my list to document is line segmentation, though I > should probably try to add more information on how cube works first. > > I hope this looks useful to people, and inspires everyone to dig > into all of the code :) > > Nick > > On Mon, Jun 03, 2013 at 10:49:46AM -0400, Sven Pedersen wrote: > > Sounds good. I think we should make some attempt to reverse engineer the > Cube > > engine. I imagine Google will eventually release documentation, but we > don't > > know when, if we document it they may be more inclined to give their > side of it > > more quickly. It is very possible they don't have much internal > documentation > > anyway. > > --Sven > > > > > > On Mon, Jun 3, 2013 at 10:25 AM, Nick White <[email protected]> > wrote: > > > > I wonder, would others here be interested in figuring out and > > documenting little bits of how the code works? > > > > I spent some time in the line segmentation code a little while ago, > > to figure out better configuration parameters for line segmentation > > for the Ancient Greek training (which ended up being pretty > > successful), and I could certainly contribute a partial description > > of how it works. > > > > If others are interested in doing this for key sections (like the > > parts Dmitri suggested), perhaps we should set up a wiki and get to > > work? It wouldn't be comprehensive, of course, but sharing what we > > know could still prove pretty useful. > > > > What do people think? Is anyone else interested in doing this? > > > > I'll dig out the (very scrappy) notes I made on line segmentation, > > clean them up, and post them here, when I get time. If anyone else > > is interested, I'll set up a wiki somewhere. > > > > Nick > > > > On Thu, May 30, 2013 at 07:32:52PM +0400, Dmitri Silaev wrote: > > > Excellent post, Nick! The more I read, the more I felt I had to ask > > > these questions myself, but didn't yet. I'm afraid, though, many of > > > them would remain unanswered. > > > > > > Because after several years of monitoring and asking in this forum > I > > > got used to the feeling that principal developers make only new > > > release announcements. In the early years, they were much more > active > > > in discussions. I can suppose many of forum questions are tedious > to > > > answer over and over again, the forum search can be used, and many > > > people just feel lazy to use it. But some of them are not like that > > > and deserve answers. > > > > > > Now it looks like Google is doing us a favor making a formerly > > > commercial engine outsource and sharing its developments from time > to > > > time. The community contribution now is constrained by enhancing > > > release packages and fixing trivial bugs. Without a proper > > > documentation or at least clues on how all this (not only Cube) > works, > > > developers keep community contribution nominal. I personally need > more > > > info and am ready to contribute, if I begin to understand the code > > > enough. I used to surf the code alone, but the potential of this > > > approach is limited. Off the bat, I'm interested in segmentation, > > > details on class pruner and integer matcher, description of Cube, > best > > > practices on training data generation. I think, there are more to > > > come, once I get more info on these. > > > > > > -- > > > Dmitri > > > > > > > > > On Thu, May 30, 2013 at 6:48 PM, Nick White < > [email protected]> > > wrote: > > > > Hi Tesseractors, > > > > > > > > I am feeling a bit fed up about the lack of openness with the > > > > Tesseract project. > > > > > > > > The addition of the cube mode, and several trainings, with > > > > absolutely no documentation, or (as far as I can tell) any tools > to > > > > create cube training files, is a good example of this. > > > > > > > > As is the lack of tif/box files for any of the core training > files > > > > in the project. > > > > > > > > Keeping the cube tools and documentation private sucks royally. > If > > > > they aren't perfect or polished, it doesn't matter; we could help > > > > to fix them up! > > > > > > > > I suspect some of the tif/box files for training aren't being > > > > released because of concerns about copyright of the image files. > If > > > > that's the case please work to clear them up, or create freely > > > > reusable versions. > > > > > > > > I love Tesseract; having a very high quality free software OCR > > > > package is awesome, and I'm very grateful for the amazing work > being > > > > done on it. But I find the lack of parity between those inside > > > > Google and the wider community to be rather troubling. > > > > > > > > If there's anything I can do to help make cube training tools and > > > > documentation available, or the training source files, I'd be > very > > > > happy to help. Replying offlist if appropriate is fine. > > > > > > > > Nick > > > > > > > > -- > > > > -- > > > > You received this message because you are subscribed to the > Google > > > > Groups "tesseract-ocr" group. > > > > To post to this group, send email to > [email protected] > > > > To unsubscribe from this group, send email to > > > > [email protected] > > > > For more options, visit this group at > > > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > > > > > --- > > > > You received this message because you are subscribed to the > Google > > Groups "tesseract-ocr" group. > > > > To unsubscribe from this group and stop receiving emails from > it, send > > an email to [email protected]. > > > > For more options, visit https://groups.google.com/groups/opt_out > . > > > > > > > > > > > > > > -- > > > -- > > > You received this message because you are subscribed to the Google > > > Groups "tesseract-ocr" group. > > > To post to this group, send email to > [email protected] > > > To unsubscribe from this group, send email to > > > [email protected] > > > For more options, visit this group at > > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > > > --- > > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > > To unsubscribe from this group and stop receiving emails from it, > send an > > email to [email protected]. > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > --- > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, > send an > > email to [email protected]. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > > > > > > > > > -- > > ``All that is gold does not glitter, > > not all those who wander are lost; > > the old that is strong does not wither, > > deep roots are not reached by the frost. > > From the ashes a fire shall be woken, > > a light from the shadows shall spring; > > renewed shall be blade that was broken, > > the crownless again shall be king.” > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > --- > > You received this message because you are subscribed to the Google Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email > > to [email protected]. > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

