Anybody available for contract OCR work?

David Barrett Sun, 25 Sep 2011 00:20:26 -0700

Hi!  I'm looking to hire some OCR contractors, are you interested?
Any code changes you make to Tesseract can be freely contributed back
to the mainline, so any Tesseract contributors looking to get paid to
develop the platform -- come talk to me.  More information below.
Thanks!

-david

http://www.elance.com/j/read-text-from-mobile-camera-images-serverside-ocr-of-mobile-images/26253541/

Read text from mobile camera images (serverside OCR of mobile images)

Hi! Expensify process hundreds of thousands of receipt images every
month. We have a proprietary scanning system in place that reads
receipt images, but we're not scanning experts and are eager for
someone with more experience to take another look. The project is:

1) We'll give you a large corpus of receipt images taken under a wide
variety of lighting conditions, across every mobile device under the
sun.

2) You create a PHP function (probably calling out to some native
binary or library) that can execute on 64-bit Ubuntu 10.04. It accepts
one parameter -- a filename to an image on disk in JPG form.

3) This function returns the text obtained from the receipt image as a
simple unstructured string -- just all the numbers and letters read
off of it. (The eventual goal is to "structure" the response to
identify what is the merchant, amount, and date of the purchase, but
scanning *anything* is the start.)

I *imagine* you'll want to do something like:

1) Edge detect to find the receipt
2) De-warp into a receipt shape
3) Reduce to grayscale or b/w (whitebalanced color is nice, but not
necessary)
4) Orient correctly (or just scan 4 times, once for each orientation,
and take the best result)
5) OCR

However, that's just a suggestion -- how you make this happen is
entirely up to you. If it involves open source libraries, licensing
proprietary libraries, writing new code from scratch, whatever.
Ideally we'd do this at zero marginal cost (eg, if we do license
anything it'd be for a fixed cost), but marginal cost is acceptable if
that's what it takes.

Please include in your bid the following milestones:

0) A general plan of attack. List what libraries you intend to use,
what experience you've had in the past doing this sort of work, what
you've found works well in general and what problems you've
encountered, etc. Don't worry about us stealing your ideas: if your
ideas are the best then we're obviously going to go with you. This is
a very, very hard problem and we're genuinely looking for a long-term
relationship (contract at the very least, possibly full-time
contracting, or even permanent hire if you're open). This is where you
demonstrate your a priori knowledge of this space.

1) A first fixed bid (the one you tell Elance) for the "image cleanup"
portion (eg, steps 1-3 above). This is pretty self-contained and
straightforward, so it should be possible to estimate its cost in a
fixed manner. We'll provide you with 100 real-world receipt images (or
more if you need): the success metric will be at least 95% success in
cleaning up these images for OCR, as judged through visual inspection.

2) A second fixed bid (just include in the comments) for the "first
pass" of your OCR approach. In essence, do everything you already know
how to do -- apply every trick in your book that you've already tried
in the past. This isn't a research phase, so it should also be
possible to estimate with a fixed bid. The success metric here is to
get to 10% accuracy -- meaning the text returned from only 10 of the
100 test receipt contains an accurate merchant name, date, and amount.
(It's ok if there is also a bunch of other garbage text returned; just
so long as the correct answers are in there somewhere.) Also include
in this bid the construction of a test harness that simply loops
across the 100 input images and tests the scanning result for the
presence of the correct strings for each.

3) An hourly rate to improve this success rate over time, along with
an estimate of how many hours/wk you can dedicate to this project (50
is the goal, but less is ok too). This is where you'll be pushed to
try new things, write new code, etc. This is where fixed-estimation
becomes impossible, so hourly makes sense.

We'll use Elance for payment, Github for collaboration. This code will
execute serverside so CPU is aplenty (mobile on-device is a very
remote stretch goal).

Sound good? Thanks!

-david

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Anybody available for contract OCR work?

Reply via email to