Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Clemens Neudecker Thu, 12 Dec 2013 04:48:45 -0800

Dear all,

Thanks to Matt and Bryan for making me aware of this interesting discussion!

My name is Clemens Neudecker and I have been the Technical Manager of the 
IMPACT project (www.impact-project.eu). Without going into greater detail 
about the points that have already been discussed at length, I would 
nevertheless like to elaborate a bit on the background of IMPACT and what 
decisions led to the use of Aletheia and PAGE in particular. Hopefully this 
will shed a bit more light on the situation.

At the time when the IMPACT project was conceived (2007), unfortunately 
neither Tesseract, Ocropus or any other open source OCR system was close to 
delivering competitive results for in particular historical documents. On 
the other hand, national and research libraries all over the world were 
mainly using Abbyy's commercial OCR for their digitization programmes. Thus 
the decision was made to cooperate with a commercial supplier as this would 
guarantee that improvements that would be made in the project would also 
end up in the real-life production workflows of the museums, libraries and 
archives sector. Other commercial partners were added to the consortium of 
research groups and libraries, always based on the assumption that a 
problem as big as OCR for the wide array of historical documents would be 
easier to tackle when combining research and commercial sector partners and 
experience. Nevertheless we were also closely watching the developments in 
the community, and I believe we have not failed to mention on many 
occassions the fact that Tesseract (and especially release 3.x) has become 
a serious competitor for commercial solutions, with the benefit of being 
developed in the open.

So within IMPACT we had a situation where there was a great diversity of 
software tools for OCR that were being worked on in the project, some open 
source, some commercial, some only available for research. etc. For all 
these different modules, proper evaluation needed to be done, which in turn 
meant that very specific ground truth had to be produced in large 
quantities and with a very high granularity. This was why Aletheia was 
built: it aims to address the two main issues with the ground truth 
production in IMPACT:

1) The (various) ground truth data had to be extremely specific - the PAGE 
format was the only format at the time providing a set of elements out of 
the box that was rich enough to express all these specific use cases.

2) A system had to be built that would be useable by lay people on a 
production scale - around 50k pages of ground truth had to be produced in a 
cost-efficient way using offshore service providers.

More than 50k pages of ground truth have since been produced using 
Aletheia, and feedback from the production was always integrated into the 
tool in a timely fashion by the colleagues from PRImA. I am very sorry to 
hear that so many people obviously have had problems obtaining access - I 
have informed the people over at PRImA and am confident they will be able 
to provide you all with the software asap. 

Which leads me to the second point I want to make - about open source. In 
my personal opinion, open source tools and community building are the best 
strategy for addressing the challenges OCR still has (and there are many) 
in a sustainable way. In my role as the Technical Manager of IMPACT I was 
also responsible for the technical integration of the software that was 
built. And we deliberately decided against an integrated software product 
that would have had to be commercial due to the integration of intellectual 
property from commercial companies, but instead rather build an 
interoperable framework based on established open source technologies such 
as Apache components and Taverna, a well established open source workflow 
management tool from the bio sciences. This allowed a loose coupling of 
commercial and open source tools and a transparent evaluation between them. 
The system follows standard practices for interoperability and has been 
implemented using Java and web services for the greatest possible 
interoperability. It has been developed in the open under an Apache 2.0 
license (thus allowing even commercial exploitation). You can find the 
sources here if your interested: 
https://github.com/impactcentre/interoperability-framework.

Next to that, we have also been advocating open source to the various 
partners that developed software in the IMPACT project. But we also have to 
respect their instutitonal policies and intellectual property. However, 
since the end of the IMPACT project, more and more software tools have 
continuously been made available with source code. The main aim of the 
IMPACT Centre (www.digitisation.eu), a not-for-profit organisation founded 
to sustain IMPACT outcomes and foster community building, has since been to 
combine existing and new developments into the exact open-source, 
transparent and fully interoperable OCR tool chain that was mentioned 
earlier. In this function we are also a collaborator in the eMOP effort. 
However, as everyone who has been building open source software in an 
international collaborative setup will know, this is a long and tedious 
process and we are still very much busy with making tools ready for 
release. To give you a quick account of what is currently available and in 
the pipeline:

- the interoperable framework mentioned above: 
https://github.com/impactcentre/interoperability-framework
- the inventory extraction (clustering) tool from IMPACT: 
https://github.com/impactcentre/inventory-extraction
- one of the post-correction tools from IMPACT: 
https://github.com/impactcentre/PoCoTo
- an OCR evaluation tool with support for PAGE, but also hOCR and other 
formats: https://github.com/impactcentre/ocrevalUAtion
- a retrieval system that can leverage dictionaries and language 
technologies built in IMPACT: https://github.com/INL/BlackLab

Further modules from IMPACT that will be released as open source within 
Q1-2/2014:

- a Java tool for training Tesseract based on PAGE xml instances, including 
some basic classes to operate on PAGE
(I am currently beta testing this, it will remove some of the dependencies 
on Aletheia in the eMOP process)
- more tools for building OCR lexica and historical dictionaries

Watch the space at https://github.com/impactcentre and 
http://www.digitisation.eu/tools/ in order to hear of all these tools being 
made available! Also, from what I understand, one of the outcomes of eMOP 
will be a (although feature-reduced) web-based and open-source version of 
Aletheia. Besides, the XSD for PAGE is available and e.g. basic Java 
classes for working with PAGE can in principle be automatically generated 
from that.

Regarding ground truth, while a couple of IMPACT datasets with ground truth 
have already been released (see www.digitisation.eu/data/), also here more 
can be expected to follow in the course of 2014. I believe that, once 
released in its entirety, the availability of 50k pages of ground truth for 
historical documents in PAGE format is one of the biggest assets of PAGE 
and Aletheia. With more of that ground truth being released and produced, 
and more tools being made available that can operate on the PAGE format, I 
hope this will create some momentum in the OCR community also beyond former 
IMPACT consortium partners.

Within the IMPACT Centre we are very much busy with making all these 
resources available, and we would very much encourage everyone here to get 
involved with the Centre (you can register for free as a user), and be in 
touch about the needs, concerns and expectations of the community towards 
IMPACT. I personally have been involved with OCR technology for more than a 
decade, and it is close to my heart. As Nick mentioned, the community is 
not very large, and having been at ICDAR and other events, there is still 
rooom for improvement with regards to sharing of results, methods and 
implementations. The IMPACT Centre aims to provide a sustainable 
infrastructure where such community collaboration can develop - but for 
that we also need YOUR help and input.

Very much looking forward to be in touch, here or over at 
www.digitisation.eu,

Best regards,
Clemens

On Thursday, December 12, 2013 6:02:08 AM UTC+1, jsbien wrote:
>
> Dear Matthew, thank you  for your long letter. 
>
> To make a long story short, I'm familiar with the old typography   
> problems but I have no experience with tesseract training. 
>
> I may however point you to the report concerning an experiment   
> consisting in training tesseract on old Polish texts with the same   
> problems which you describe: 
>
> http://lib.psnc.pl/publication/428 
>
> Both the texts, as images and PAGE files, are publicly available at 
>
> http://dl.psnc.pl/activities/projekty/impact/results/ 
>
> Please note that the trained dataset is also available at 
>
> http://dl.psnc.pl/download/tesseract_traineddata.zip 
>
> The training used "classical" rectangular method. 
>
> To say the truth, I don't know how efficient the training was as I'm   
> not aware of any large scale application of the trained dataset. Using   
> it is one of the user options at Virtual Transcription Laboratory   
> (http://wlt.synat.pcss.pl/wlt-web/index.xhtml), but I have no idea who   
> uses it and for what. 
>
> It would be interesting to retrain tesseract using your approach on   
> the data described above and to compare the results, but I'm afraid   
> nobody has time and motivation for it. 
>
> Best regards and good luck with your project 
>
> Janusz 
>
>
> -- 
> Prof. dr hab. Janusz S. Bień -  Uniwersytet Warszawski (Katedra   
> Lingwistyki Formalnej) 
> Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics 
> Department) 
> [email protected] <javascript:>, [email protected] <javascript:>, 
> http://fleksem.klf.uw.edu.pl/~jsbien/ 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to