Dear all, Thanks to Matt and Bryan for making me aware of this interesting discussion!
My name is Clemens Neudecker and I have been the Technical Manager of the IMPACT project (www.impact-project.eu). Without going into greater detail about the points that have already been discussed at length, I would nevertheless like to elaborate a bit on the background of IMPACT and what decisions led to the use of Aletheia and PAGE in particular. Hopefully this will shed a bit more light on the situation. At the time when the IMPACT project was conceived (2007), unfortunately neither Tesseract, Ocropus or any other open source OCR system was close to delivering competitive results for in particular historical documents. On the other hand, national and research libraries all over the world were mainly using Abbyy's commercial OCR for their digitization programmes. Thus the decision was made to cooperate with a commercial supplier as this would guarantee that improvements that would be made in the project would also end up in the real-life production workflows of the museums, libraries and archives sector. Other commercial partners were added to the consortium of research groups and libraries, always based on the assumption that a problem as big as OCR for the wide array of historical documents would be easier to tackle when combining research and commercial sector partners and experience. Nevertheless we were also closely watching the developments in the community, and I believe we have not failed to mention on many occassions the fact that Tesseract (and especially release 3.x) has become a serious competitor for commercial solutions, with the benefit of being developed in the open. So within IMPACT we had a situation where there was a great diversity of software tools for OCR that were being worked on in the project, some open source, some commercial, some only available for research. etc. For all these different modules, proper evaluation needed to be done, which in turn meant that very specific ground truth had to be produced in large quantities and with a very high granularity. This was why Aletheia was built: it aims to address the two main issues with the ground truth production in IMPACT: 1) The (various) ground truth data had to be extremely specific - the PAGE format was the only format at the time providing a set of elements out of the box that was rich enough to express all these specific use cases. 2) A system had to be built that would be useable by lay people on a production scale - around 50k pages of ground truth had to be produced in a cost-efficient way using offshore service providers. More than 50k pages of ground truth have since been produced using Aletheia, and feedback from the production was always integrated into the tool in a timely fashion by the colleagues from PRImA. I am very sorry to hear that so many people obviously have had problems obtaining access - I have informed the people over at PRImA and am confident they will be able to provide you all with the software asap. Which leads me to the second point I want to make - about open source. In my personal opinion, open source tools and community building are the best strategy for addressing the challenges OCR still has (and there are many) in a sustainable way. In my role as the Technical Manager of IMPACT I was also responsible for the technical integration of the software that was built. And we deliberately decided against an integrated software product that would have had to be commercial due to the integration of intellectual property from commercial companies, but instead rather build an interoperable framework based on established open source technologies such as Apache components and Taverna, a well established open source workflow management tool from the bio sciences. This allowed a loose coupling of commercial and open source tools and a transparent evaluation between them. The system follows standard practices for interoperability and has been implemented using Java and web services for the greatest possible interoperability. It has been developed in the open under an Apache 2.0 license (thus allowing even commercial exploitation). You can find the sources here if your interested: https://github.com/impactcentre/interoperability-framework. Next to that, we have also been advocating open source to the various partners that developed software in the IMPACT project. But we also have to respect their instutitonal policies and intellectual property. However, since the end of the IMPACT project, more and more software tools have continuously been made available with source code. The main aim of the IMPACT Centre (www.digitisation.eu), a not-for-profit organisation founded to sustain IMPACT outcomes and foster community building, has since been to combine existing and new developments into the exact open-source, transparent and fully interoperable OCR tool chain that was mentioned earlier. In this function we are also a collaborator in the eMOP effort. However, as everyone who has been building open source software in an international collaborative setup will know, this is a long and tedious process and we are still very much busy with making tools ready for release. To give you a quick account of what is currently available and in the pipeline: - the interoperable framework mentioned above: https://github.com/impactcentre/interoperability-framework - the inventory extraction (clustering) tool from IMPACT: https://github.com/impactcentre/inventory-extraction - one of the post-correction tools from IMPACT: https://github.com/impactcentre/PoCoTo - an OCR evaluation tool with support for PAGE, but also hOCR and other formats: https://github.com/impactcentre/ocrevalUAtion - a retrieval system that can leverage dictionaries and language technologies built in IMPACT: https://github.com/INL/BlackLab Further modules from IMPACT that will be released as open source within Q1-2/2014: - a Java tool for training Tesseract based on PAGE xml instances, including some basic classes to operate on PAGE (I am currently beta testing this, it will remove some of the dependencies on Aletheia in the eMOP process) - more tools for building OCR lexica and historical dictionaries Watch the space at https://github.com/impactcentre and http://www.digitisation.eu/tools/ in order to hear of all these tools being made available! Also, from what I understand, one of the outcomes of eMOP will be a (although feature-reduced) web-based and open-source version of Aletheia. Besides, the XSD for PAGE is available and e.g. basic Java classes for working with PAGE can in principle be automatically generated from that. Regarding ground truth, while a couple of IMPACT datasets with ground truth have already been released (see www.digitisation.eu/data/), also here more can be expected to follow in the course of 2014. I believe that, once released in its entirety, the availability of 50k pages of ground truth for historical documents in PAGE format is one of the biggest assets of PAGE and Aletheia. With more of that ground truth being released and produced, and more tools being made available that can operate on the PAGE format, I hope this will create some momentum in the OCR community also beyond former IMPACT consortium partners. Within the IMPACT Centre we are very much busy with making all these resources available, and we would very much encourage everyone here to get involved with the Centre (you can register for free as a user), and be in touch about the needs, concerns and expectations of the community towards IMPACT. I personally have been involved with OCR technology for more than a decade, and it is close to my heart. As Nick mentioned, the community is not very large, and having been at ICDAR and other events, there is still rooom for improvement with regards to sharing of results, methods and implementations. The IMPACT Centre aims to provide a sustainable infrastructure where such community collaboration can develop - but for that we also need YOUR help and input. Very much looking forward to be in touch, here or over at www.digitisation.eu, Best regards, Clemens On Thursday, December 12, 2013 6:02:08 AM UTC+1, jsbien wrote: > > Dear Matthew, thank you for your long letter. > > To make a long story short, I'm familiar with the old typography > problems but I have no experience with tesseract training. > > I may however point you to the report concerning an experiment > consisting in training tesseract on old Polish texts with the same > problems which you describe: > > http://lib.psnc.pl/publication/428 > > Both the texts, as images and PAGE files, are publicly available at > > http://dl.psnc.pl/activities/projekty/impact/results/ > > Please note that the trained dataset is also available at > > http://dl.psnc.pl/download/tesseract_traineddata.zip > > The training used "classical" rectangular method. > > To say the truth, I don't know how efficient the training was as I'm > not aware of any large scale application of the trained dataset. Using > it is one of the user options at Virtual Transcription Laboratory > (http://wlt.synat.pcss.pl/wlt-web/index.xhtml), but I have no idea who > uses it and for what. > > It would be interesting to retrain tesseract using your approach on > the data described above and to compare the results, but I'm afraid > nobody has time and motivation for it. > > Best regards and good luck with your project > > Janusz > > > -- > Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics > Department) > jsb...@uw.edu.pl <javascript:>, jsb...@mimuw.edu.pl <javascript:>, > http://fleksem.klf.uw.edu.pl/~jsbien/ > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.