Hi! I'm looking to hire some OCR contractors, are you interested? Any code changes you make to Tesseract can be freely contributed back to the mainline, so any Tesseract contributors looking to get paid to develop the platform -- come talk to me. More information below. Thanks!
-david http://www.elance.com/j/read-text-from-mobile-camera-images-serverside-ocr-of-mobile-images/26253541/ Read text from mobile camera images (serverside OCR of mobile images) Hi! Expensify process hundreds of thousands of receipt images every month. We have a proprietary scanning system in place that reads receipt images, but we're not scanning experts and are eager for someone with more experience to take another look. The project is: 1) We'll give you a large corpus of receipt images taken under a wide variety of lighting conditions, across every mobile device under the sun. 2) You create a PHP function (probably calling out to some native binary or library) that can execute on 64-bit Ubuntu 10.04. It accepts one parameter -- a filename to an image on disk in JPG form. 3) This function returns the text obtained from the receipt image as a simple unstructured string -- just all the numbers and letters read off of it. (The eventual goal is to "structure" the response to identify what is the merchant, amount, and date of the purchase, but scanning *anything* is the start.) I *imagine* you'll want to do something like: 1) Edge detect to find the receipt 2) De-warp into a receipt shape 3) Reduce to grayscale or b/w (whitebalanced color is nice, but not necessary) 4) Orient correctly (or just scan 4 times, once for each orientation, and take the best result) 5) OCR However, that's just a suggestion -- how you make this happen is entirely up to you. If it involves open source libraries, licensing proprietary libraries, writing new code from scratch, whatever. Ideally we'd do this at zero marginal cost (eg, if we do license anything it'd be for a fixed cost), but marginal cost is acceptable if that's what it takes. Please include in your bid the following milestones: 0) A general plan of attack. List what libraries you intend to use, what experience you've had in the past doing this sort of work, what you've found works well in general and what problems you've encountered, etc. Don't worry about us stealing your ideas: if your ideas are the best then we're obviously going to go with you. This is a very, very hard problem and we're genuinely looking for a long-term relationship (contract at the very least, possibly full-time contracting, or even permanent hire if you're open). This is where you demonstrate your a priori knowledge of this space. 1) A first fixed bid (the one you tell Elance) for the "image cleanup" portion (eg, steps 1-3 above). This is pretty self-contained and straightforward, so it should be possible to estimate its cost in a fixed manner. We'll provide you with 100 real-world receipt images (or more if you need): the success metric will be at least 95% success in cleaning up these images for OCR, as judged through visual inspection. 2) A second fixed bid (just include in the comments) for the "first pass" of your OCR approach. In essence, do everything you already know how to do -- apply every trick in your book that you've already tried in the past. This isn't a research phase, so it should also be possible to estimate with a fixed bid. The success metric here is to get to 10% accuracy -- meaning the text returned from only 10 of the 100 test receipt contains an accurate merchant name, date, and amount. (It's ok if there is also a bunch of other garbage text returned; just so long as the correct answers are in there somewhere.) Also include in this bid the construction of a test harness that simply loops across the 100 input images and tests the scanning result for the presence of the correct strings for each. 3) An hourly rate to improve this success rate over time, along with an estimate of how many hours/wk you can dedicate to this project (50 is the goal, but less is ok too). This is where you'll be pushed to try new things, write new code, etc. This is where fixed-estimation becomes impossible, so hourly makes sense. We'll use Elance for payment, Github for collaboration. This code will execute serverside so CPU is aplenty (mobile on-device is a very remote stretch goal). Sound good? Thanks! -david -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

