Sent from my iPad

On 25 Sep, 2011, at 11:51, David Barrett <[email protected]> wrote:

> Hi!  I'm looking to hire some OCR contractors, are you interested?
> Any code changes you make to Tesseract can be freely contributed back
> to the mainline, so any Tesseract contributors looking to get paid to
> develop the platform -- come talk to me.  More information below.
> Thanks!
> 
> -david
> 
> http://www.elance.com/j/read-text-from-mobile-camera-images-serverside-ocr-of-mobile-images/26253541/
> 
> Read text from mobile camera images (serverside OCR of mobile images)
> 
> Hi! Expensify process hundreds of thousands of receipt images every
> month. We have a proprietary scanning system in place that reads
> receipt images, but we're not scanning experts and are eager for
> someone with more experience to take another look. The project is:
> 
> 1) We'll give you a large corpus of receipt images taken under a wide
> variety of lighting conditions, across every mobile device under the
> sun.
> 
> 2) You create a PHP function (probably calling out to some native
> binary or library) that can execute on 64-bit Ubuntu 10.04. It accepts
> one parameter -- a filename to an image on disk in JPG form.
> 
> 3) This function returns the text obtained from the receipt image as a
> simple unstructured string -- just all the numbers and letters read
> off of it. (The eventual goal is to "structure" the response to
> identify what is the merchant, amount, and date of the purchase, but
> scanning *anything* is the start.)
> 
> I *imagine* you'll want to do something like:
> 
> 1) Edge detect to find the receipt
> 2) De-warp into a receipt shape
> 3) Reduce to grayscale or b/w (whitebalanced color is nice, but not
> necessary)
> 4) Orient correctly (or just scan 4 times, once for each orientation,
> and take the best result)
> 5) OCR
> 
> However, that's just a suggestion -- how you make this happen is
> entirely up to you. If it involves open source libraries, licensing
> proprietary libraries, writing new code from scratch, whatever.
> Ideally we'd do this at zero marginal cost (eg, if we do license
> anything it'd be for a fixed cost), but marginal cost is acceptable if
> that's what it takes.
> 
> Please include in your bid the following milestones:
> 
> 0) A general plan of attack. List what libraries you intend to use,
> what experience you've had in the past doing this sort of work, what
> you've found works well in general and what problems you've
> encountered, etc. Don't worry about us stealing your ideas: if your
> ideas are the best then we're obviously going to go with you. This is
> a very, very hard problem and we're genuinely looking for a long-term
> relationship (contract at the very least, possibly full-time
> contracting, or even permanent hire if you're open). This is where you
> demonstrate your a priori knowledge of this space.
> 
> 1) A first fixed bid (the one you tell Elance) for the "image cleanup"
> portion (eg, steps 1-3 above). This is pretty self-contained and
> straightforward, so it should be possible to estimate its cost in a
> fixed manner. We'll provide you with 100 real-world receipt images (or
> more if you need): the success metric will be at least 95% success in
> cleaning up these images for OCR, as judged through visual inspection.
> 
> 2) A second fixed bid (just include in the comments) for the "first
> pass" of your OCR approach. In essence, do everything you already know
> how to do -- apply every trick in your book that you've already tried
> in the past. This isn't a research phase, so it should also be
> possible to estimate with a fixed bid. The success metric here is to
> get to 10% accuracy -- meaning the text returned from only 10 of the
> 100 test receipt contains an accurate merchant name, date, and amount.
> (It's ok if there is also a bunch of other garbage text returned; just
> so long as the correct answers are in there somewhere.) Also include
> in this bid the construction of a test harness that simply loops
> across the 100 input images and tests the scanning result for the
> presence of the correct strings for each.
> 
> 3) An hourly rate to improve this success rate over time, along with
> an estimate of how many hours/wk you can dedicate to this project (50
> is the goal, but less is ok too). This is where you'll be pushed to
> try new things, write new code, etc. This is where fixed-estimation
> becomes impossible, so hourly makes sense.
> 
> We'll use Elance for payment, Github for collaboration. This code will
> execute serverside so CPU is aplenty (mobile on-device is a very
> remote stretch goal).
> 
> Sound good? Thanks!
> 
> -david
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to