Hello, all!

I'm currently talking with a group of MuckRock users about automatically 
OCR'ing a very large set (tens of millions) of CIA documents.

It looks like this will take many months to scan on a single machine, but I 
think it could happen in far less time if done in parallel on AWS Lambda 
(or similar) or on an elastic cluster.

It will take a little bit of work to design and build this architecture 
(two architectures, in fact, one optimized for speed and one optimized for 
cost), so I think it would be nice if we could build out this system in a 
way that would benefit the larger community. Therefore, I'd like to float 
the proposal that we start a new Free and Open Source software project for 
tools, templates and guides to build queue-based elastic and server-less 
Tesseract systems which are capable of quickly and affordably scanning 
millions of documents in the cloud.

Would anybody on this list be interested in working on something like this?

Even more specifically - since Google is maintaining ownership of the 
Tesseract project, and Google also owns the Google Cloud Platform, would 
Google be willing to devote some resources into sponsoring the creation of 
this project, if it could be designed to run on the Google Cloud (GCE/GCF) 
and using Google technologies (k8s)? If not, does anybody know of any other 
organizations which would be interested in throwing some resources at this?

It's just an idea, but it's something that I'd like to work on if the 
resources are available that I think would have a very large impact for a 
number of different communities. 

Thanks for your consideration and feedback,
Rich Jones
https://github.com/Miserlou

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/db39e199-8824-4122-9560-bce7f0232ae0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to