Sent from my iPad
On 25 Sep, 2011, at 11:51, David Barrett <[email protected]> wrote: > Hi! I'm looking to hire some OCR contractors, are you interested? > Any code changes you make to Tesseract can be freely contributed back > to the mainline, so any Tesseract contributors looking to get paid to > develop the platform -- come talk to me. More information below. > Thanks! > > -david > > http://www.elance.com/j/read-text-from-mobile-camera-images-serverside-ocr-of-mobile-images/26253541/ > > Read text from mobile camera images (serverside OCR of mobile images) > > Hi! Expensify process hundreds of thousands of receipt images every > month. We have a proprietary scanning system in place that reads > receipt images, but we're not scanning experts and are eager for > someone with more experience to take another look. The project is: > > 1) We'll give you a large corpus of receipt images taken under a wide > variety of lighting conditions, across every mobile device under the > sun. > > 2) You create a PHP function (probably calling out to some native > binary or library) that can execute on 64-bit Ubuntu 10.04. It accepts > one parameter -- a filename to an image on disk in JPG form. > > 3) This function returns the text obtained from the receipt image as a > simple unstructured string -- just all the numbers and letters read > off of it. (The eventual goal is to "structure" the response to > identify what is the merchant, amount, and date of the purchase, but > scanning *anything* is the start.) > > I *imagine* you'll want to do something like: > > 1) Edge detect to find the receipt > 2) De-warp into a receipt shape > 3) Reduce to grayscale or b/w (whitebalanced color is nice, but not > necessary) > 4) Orient correctly (or just scan 4 times, once for each orientation, > and take the best result) > 5) OCR > > However, that's just a suggestion -- how you make this happen is > entirely up to you. If it involves open source libraries, licensing > proprietary libraries, writing new code from scratch, whatever. > Ideally we'd do this at zero marginal cost (eg, if we do license > anything it'd be for a fixed cost), but marginal cost is acceptable if > that's what it takes. > > Please include in your bid the following milestones: > > 0) A general plan of attack. List what libraries you intend to use, > what experience you've had in the past doing this sort of work, what > you've found works well in general and what problems you've > encountered, etc. Don't worry about us stealing your ideas: if your > ideas are the best then we're obviously going to go with you. This is > a very, very hard problem and we're genuinely looking for a long-term > relationship (contract at the very least, possibly full-time > contracting, or even permanent hire if you're open). This is where you > demonstrate your a priori knowledge of this space. > > 1) A first fixed bid (the one you tell Elance) for the "image cleanup" > portion (eg, steps 1-3 above). This is pretty self-contained and > straightforward, so it should be possible to estimate its cost in a > fixed manner. We'll provide you with 100 real-world receipt images (or > more if you need): the success metric will be at least 95% success in > cleaning up these images for OCR, as judged through visual inspection. > > 2) A second fixed bid (just include in the comments) for the "first > pass" of your OCR approach. In essence, do everything you already know > how to do -- apply every trick in your book that you've already tried > in the past. This isn't a research phase, so it should also be > possible to estimate with a fixed bid. The success metric here is to > get to 10% accuracy -- meaning the text returned from only 10 of the > 100 test receipt contains an accurate merchant name, date, and amount. > (It's ok if there is also a bunch of other garbage text returned; just > so long as the correct answers are in there somewhere.) Also include > in this bid the construction of a test harness that simply loops > across the 100 input images and tests the scanning result for the > presence of the correct strings for each. > > 3) An hourly rate to improve this success rate over time, along with > an estimate of how many hours/wk you can dedicate to this project (50 > is the goal, but less is ok too). This is where you'll be pushed to > try new things, write new code, etc. This is where fixed-estimation > becomes impossible, so hourly makes sense. > > We'll use Elance for payment, Github for collaboration. This code will > execute serverside so CPU is aplenty (mobile on-device is a very > remote stretch goal). > > Sound good? Thanks! > > -david > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

